Recon for Pentesting Using Python: Ethical Email Discovery for Security Assessments

October 26, 202310 min readBy Vincent Olagbemide

Reconnaissance is one of the most important stages of a penetration test. Before a security team can understand how an organisation may be exposed, they first need to understand what information is already publicly available.

One common area of reconnaissance is email discovery.

Email addresses can reveal a lot about an organisation’s external footprint. They may expose naming conventions, departments, public-facing staff accounts, third-party references, and potential targets for phishing simulation, credential-stuffing analysis, or awareness training.

This article explains how ethical email discovery works during an authorised penetration test and demonstrates a simple Python-based approach for extracting email addresses from publicly accessible web pages.

This is not about harvesting emails for spam or abuse. It is about understanding how publicly exposed information can increase organisational risk.

Why Email Discovery Matters in Penetration Testing

Attackers rarely begin with exploitation. They usually begin with information.

Before launching phishing campaigns, password spraying, business email compromise attempts, or social engineering attacks, threat actors often collect public information about a target organisation.

Email addresses are especially valuable because they can help an attacker understand:

employee naming patterns
departmental inboxes
externally exposed staff accounts
abandoned or legacy email references
third-party service relationships
possible phishing targets
possible password spraying targets

For defenders, this same information is useful because it helps answer an important question:

What can an attacker already learn about us before touching any internal system?

That is why email discovery can be useful during an authorised security assessment.

Ethical and Legal Scope Comes First

Before running any reconnaissance script, the most important step is not technical. It is authorisation.

Email discovery should only be performed when you have permission to assess the target domain or organisation.

A proper engagement should define:

the target domain or list of domains

what types of reconnaissance are allowed
whether crawling is permitted
whether third-party sites can be checked
rate limits or traffic restrictions
how collected data should be handled
how findings should be reported
how sensitive information should be stored or deleted

What This Python Script Does

The goal of this script is simple:

Start from a target URL.
Visit publicly accessible pages.
Extract email addresses found in the page source.
Follow internal links discovered on the page.
Continue crawling up to a defined limit.
Print discovered email addresses.

The script uses:

requests to fetch web pages
BeautifulSoup to parse HTML
deque to manage URLs waiting to be visited
urllib.parse to handle relative and absolute links
re to detect email address patterns

This is a basic educational example. In a real engagement, you would add stricter scope control, rate limiting, logging, error handling, robots.txt awareness, and proper reporting.

Python Email Discovery Script

import re
import urllib.parse
from collections import deque

import requests
import requests.exceptions
from bs4 import BeautifulSoup


def extract_emails(html: str) -> set[str]:
    """
    Extract email addresses from HTML/text content.
    """
    email_pattern = r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}"
    return set(re.findall(email_pattern, html))


def normalize_link(link: str, base_url: str, current_path: str) -> str:
    """
    Convert relative links into absolute links.
    """
    if link.startswith("/"):
        return urllib.parse.urljoin(base_url, link)

    if not link.startswith("http"):
        return urllib.parse.urljoin(current_path, link)

    return link


def is_same_domain(url: str, original_domain: str) -> bool:
    """
    Keep crawling within the original target domain.
    This helps avoid accidentally crawling unrelated third-party sites.
    """
    try:
        parsed_url = urllib.parse.urlsplit(url)
        return parsed_url.netloc == original_domain
    except ValueError:
        return False


def crawl_for_emails(start_url: str, max_pages: int = 200) -> set[str]:
    """
    Crawl a website and extract publicly visible email addresses.
    """
    urls_to_visit = deque([start_url])
    visited_urls = set()
    discovered_emails = set()

    original_parts = urllib.parse.urlsplit(start_url)
    original_domain = original_parts.netloc

    page_count = 0

    while urls_to_visit and page_count < max_pages:
        url = urls_to_visit.popleft()

        if url in visited_urls:
            continue

        visited_urls.add(url)
        page_count += 1

        print(f"[{page_count}] Crawling: {url}")

        try:
            response = requests.get(
                url,
                timeout=10,
                headers={
                    "User-Agent": "AuthorizedSecurityAssessmentBot/1.0"
                },
            )
        except (
            requests.exceptions.MissingSchema,
            requests.exceptions.ConnectionError,
            requests.exceptions.Timeout,
            requests.exceptions.TooManyRedirects,
        ):
            continue

        if response.status_code != 200:
            continue

        content_type = response.headers.get("Content-Type", "")

        if "text/html" not in content_type:
            continue

        discovered_emails.update(extract_emails(response.text))

        parts = urllib.parse.urlsplit(url)
        base_url = f"{parts.scheme}://{parts.netloc}"
        current_path = url.rsplit("/", 1)[0] + "/"

        soup = BeautifulSoup(response.text, features="lxml")

        for anchor in soup.find_all("a"):
            href = anchor.get("href")

            if not href:
                continue

            normalized_url = normalize_link(href, base_url, current_path)

            if not is_same_domain(normalized_url, original_domain):
                continue

            if normalized_url not in visited_urls and normalized_url not in urls_to_visit:
                urls_to_visit.append(normalized_url)

    return discovered_emails


if __name__ == "__main__":
    target_url = input("[+] Enter target URL: ").strip()

    if not target_url.startswith(("http://", "https://")):
        print("[-] Please include http:// or https:// in the target URL.")
        exit(1)

    emails = crawl_for_emails(target_url)

    print("\n[+] Discovered email addresses:")
    for email in sorted(emails):
        print(email)

Engineering Considerations for a Reliable Reconnaissance Script

A reconnaissance script used during a security assessment must be more than a quick scraper. It should be controlled, predictable, and safe to run within an agreed scope.

The goal is not to crawl the internet blindly. The goal is to collect relevant public exposure from authorised targets and turn that information into useful security findings.

A well-built email discovery script should consider scope, reliability, request handling, content parsing, and output quality.

1. Keep the Crawl Within Scope

Scope control is one of the most important parts of ethical reconnaissance.

A crawler should not follow every link it finds. Many websites link to external platforms such as social media pages, documentation portals, analytics services, CDNs, partner sites, and payment platforms.

During a penetration test, that matters because authorisation is usually limited to specific domains or assets.

The script should therefore restrict crawling to the original target domain unless the engagement explicitly allows more.

def is_same_domain(url: str, original_domain: str) -> bool:
    try:
        parsed_url = urllib.parse.urlsplit(url)
        return parsed_url.netloc == original_domain
    except ValueError:
        return False

This keeps the assessment focused and reduces the risk of collecting data from systems outside the authorised boundary.

In professional testing, this is not just a technical decision. It is a governance decision.

2. Use Timeouts and Safe Request Handling

Reconnaissance tools should not hang indefinitely because a target server is slow, unavailable, or misconfigured.

Adding a request timeout keeps the script responsive and prevents a single bad URL from stopping the assessment.

response = requests.get(
    url,
    timeout=10,
    headers={
        "User-Agent": "AuthorizedSecurityAssessmentBot/1.0"
    },
)

Good request handling should account for:

connection errors
missing schemas
timeouts
redirect issues
non-HTML content
non-200 responses

This makes the script more stable and suitable for repeated use during controlled security assessments.

3. Parse Only Relevant HTML Content

Not every URL points to a web page. A crawler may encounter images, PDFs, JavaScript files, CSS files, downloadable documents, and other non-HTML resources.

A responsible script should check the response type before attempting to parse it as HTML.

content_type = response.headers.get("Content-Type", "")

if "text/html" not in content_type:
    continue

This improves performance and reduces noisy results.

It also helps keep the script focused on the primary objective: identifying publicly visible email addresses exposed through web content.

4. Extract Emails With a Practical Pattern

Email extraction does not require a perfect validator. For reconnaissance, the objective is to identify likely exposed email addresses from public content.

A practical regular expression is enough for this stage:

email_pattern = r"[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}"

This pattern captures common email formats while avoiding unnecessary complexity.

The output should still be reviewed manually. Reconnaissance findings are signals, not final conclusions.

5. Use a Clear User-Agent

A professional assessment should avoid looking like anonymous or suspicious traffic.

A basic user-agent can help identify the purpose of the request:

"User-Agent": "AuthorizedSecurityAssessmentBot/1.0"

For internal assessments, red team exercises, or client-approved tests, the user-agent can be agreed in advance with the client or monitoring team.

This helps defenders distinguish authorised assessment traffic from malicious activity.

What the Results Reveal

The value of email discovery is not simply the list of addresses collected. The real value is in what the list reveals about the organisation’s external exposure.

Discovered email addresses can show:

how employee email addresses are structured
which departments are publicly exposed
whether old or inactive accounts are still referenced online
whether sensitive business functions are easy to identify
whether public pages expose more staff information than necessary
whether attackers can build a phishing target list from open sources

For example, a public website may expose addresses such as:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

From an attacker’s perspective, this information can support phishing, password spraying, impersonation, and business email compromise attempts.

From a defender’s perspective, it helps identify where controls, awareness, and monitoring should be improved.

Defensive Recommendations

Email exposure is not always a vulnerability by itself. Many organisations need public contact addresses. The risk comes from uncontrolled exposure, weak email security controls, and poor monitoring.

A security assessment should therefore convert email discovery into practical defensive actions.

Use Role-Based Public Addresses Where Appropriate

Public pages do not always need to expose individual staff email addresses.

Where possible, organisations should use controlled role-based addresses such as:

[email protected]
[email protected]
[email protected]
[email protected]

These addresses are easier to monitor, rotate, protect, and route internally.

Strengthen Email Authentication

Publicly visible email addresses increase the importance of strong email authentication.

Organisations should review:

SPF
DKIM
DMARC

A mature DMARC implementation helps reduce spoofing risk and improves trust in legitimate email communication.

Monitor for Credential Exposure

If an email address is public, it may eventually appear in breach datasets, phishing kits, spam lists, or credential stuffing attempts.

Security teams should monitor exposed corporate email addresses for:

leaked credentials
reused passwords
suspicious authentication attempts
password spraying indicators
unusual login geography
failed MFA attempts

Improve Phishing Awareness

Email discovery often supports phishing preparation. Defenders should use that knowledge to strengthen awareness training.

Training should cover:

suspicious links
attachment handling
invoice and payment fraud
impersonation attempts
executive spoofing
reporting procedures
MFA fatigue attacks

Reduce Unnecessary Exposure

Old web pages, PDF documents, archived announcements, event pages, and uploaded files can continue exposing email addresses long after they are needed.

A good remediation activity is to review public content and remove unnecessary personal email addresses where role-based alternatives would be safer.

Limitations of the Script

This script is intentionally simple and defensive in nature.

It does not:

bypass authentication
access private systems
evade detection
scrape search engines
use leaked databases
validate whether an email account is active
send emails
perform phishing
perform password spraying
interact with mail servers

That limitation is intentional.

The purpose is to identify public exposure from authorised web content, not to enable abuse.

For a more mature internal assessment tool, additional features may include:

domain allowlists
crawl delay
robots.txt awareness
structured CSV or JSON export
logging
rate limiting
report generation
duplicate filtering
risk scoring
integration with external attack surface monitoring
evidence capture for remediation tracking

Responsible Use

Email reconnaissance must be handled carefully.

Only run this type of script:

on domains you own
on systems you are authorised to assess
within a defined penetration testing scope
with permission from the organisation
for defensive, educational, or research purposes

Do not use it to harvest emails for spam, phishing, credential attacks, harassment, or unauthorised testing.

A professional security assessment is not defined by the tools used. It is defined by authorisation, scope, evidence handling, technical judgement, and responsible reporting.

Watch the Video

You can also watch the walkthrough on YouTube.

https://youtube.com/watch?v=DeQjgf_f66s%3Fsi%3DK3-RX6JbboAXUFn3

Get the Code

The source code and usage instructions are available on GitHub.

Final Thoughts

Reconnaissance is where many security assessments begin.

A simple Python script can reveal how much information an attacker may already be able to collect from public sources. However, the real value is not just in extracting email addresses. The real value is in understanding what that exposure means for the organisation.

The right questions are:

What information is publicly visible?
Why is it exposed?
Could an attacker use it?
Does it increase phishing or impersonation risk?
What controls reduce that risk?
How should the organisation monitor and respond?

That is the difference between running a script and performing a meaningful security assessment.

Discussion

Comments

Join the conversation below.

Join the discussion Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.