Link Extractor: Find All Links on a Web Page

Q: How does a link extractor work?

A link extractor works by sending an HTTP GET request to a URL, receiving the HTML response, then parsing the HTML to find all anchor (a) tags and other elements containing URLs. It extracts the href attribute from links, src from images, and url from CSS references. The extracted URLs are then classified as internal, external, or resource links, and optionally checked for validity.

Q: How do I find broken links on my website?

Use a link extractor tool to extract all links from your pages, then run a dead link check. The tool sends HTTP HEAD requests to each URL and checks the response status code. Links returning 404, 500, or connection errors are flagged as broken. RiseTop's Link Extractor can check all extracted links automatically.

SEO Tools 10 min read April 13, 2026

Every web page is a web of connections. Understanding those connections is fundamental to SEO, site maintenance, and competitive analysis. A link extractor automates the process of finding and cataloging every URL on a page. But how exactly do these tools work under the hood? This guide covers the full technical pipeline: from the initial HTTP request to the final SEO audit report.

Step 1: The HTTP Request

Every link extraction begins with fetching the web page. The tool sends an HTTP GET request to the target URL and receives an HTML response. While this sounds straightforward, several technical considerations affect the quality of extraction.

The request must include appropriate headers. A User-Agent header identifies the client. Without it, many servers return a simplified or blocked response. Similarly, Accept and Accept-Language headers tell the server what content format and language the client prefers, which can affect the HTML returned (some sites serve different content based on these headers).

Redirect handling is another critical factor. A URL might redirect through several hops before reaching the final page. A robust link extractor follows the redirect chain and reports the final destination URL. This is important because links pointing to redirecting URLs are often a sign of outdated content or restructured sites.

HTTP status codes determine whether extraction proceeds. A 200 response means the page loaded successfully. A 301 or 302 redirect means the URL has moved. A 404 means the page does not exist, and a 500 indicates a server error. A good link extractor reports the status code for every URL it encounters, not just the initial request.

For large-scale extraction, tools implement request throttling and respect robots.txt. Sending too many requests too quickly can trigger rate limiting or IP bans. Responsible tools add delays between requests and check the target site's crawl rules before proceeding.

Step 2: HTML Parsing

Once the HTML is received, the tool parses it to find link-containing elements. The primary target is the anchor tag (<a href="...">), but links also appear in other elements: <img src="...">, <link href="...">, <script src="...">, <iframe src="...">, and CSS url() references.

The parsing process uses an HTML parser (like Python's BeautifulSoup or a browser's DOM parser) to build a tree structure of the document. Each node is inspected for URL-containing attributes. The parser handles malformed HTML gracefully, which is essential because a significant percentage of web pages have invalid markup.

Relative URLs require resolution. A link like /about needs to be converted to an absolute URL using the base URL of the page. The <base> tag, if present, overrides the default base URL. This resolution step ensures that every extracted URL is a fully qualified, usable link.

URL encoding and decoding also play a role. Links might contain percent-encoded characters (%20 for spaces, %E2%9C%93 for emojis). A thorough extractor normalizes these for consistent reporting while preserving the original encoded form for accurate linking.

Step 3: Link Classification

Raw extraction produces a flat list of URLs. The next step is classification. Links are categorized into several types, each serving a different analytical purpose.

Internal Links

Internal links point to the same domain as the source page. These include navigation links, in-content links, breadcrumbs, and footer links. Internal links are the backbone of site architecture. They distribute page authority, help search engines discover pages, and guide users through the content.

When analyzing internal links, pay attention to orphan pages (pages that exist but have no internal links pointing to them). These pages are essentially invisible to both users and search engine crawlers. A link extractor that crawls multiple pages can identify orphans by comparing the set of known pages against the set of linked-to pages.

External Links

External links point to different domains. These include references to sources, affiliate links, social media profiles, and partner sites. External links affect your site's credibility and can impact SEO. Linking to authoritative sources adds trust signals, while linking to spam or low-quality sites can harm your reputation.

External link analysis also reveals your site's relationship with other domains. A high number of external links to a single domain might indicate a partnership, sponsorship, or content scraping. Monitoring changes in your external link profile over time helps detect unwanted changes.

Resource Links

Resource links point to non-HTML assets: images, CSS files, JavaScript files, fonts, videos, and PDFs. These are not navigational links but they affect page performance and user experience. A page with 50 external resource links will load slower than one with 10, all else being equal.

Resource link analysis is valuable for performance optimization. Identifying large images, unused CSS files, or third-party scripts that load from slow CDNs helps prioritize optimization efforts.

Special Link Types

Additional classifications include: mailto: links (email addresses), tel: links (phone numbers), javascript: links (inline scripts, often a code smell), fragment links (#section, for in-page navigation), and protocol-relative links (//example.com). Each type has different implications for site quality and should be handled differently in analysis.

Step 4: Dead Link Detection

One of the most valuable features of a link extractor is the ability to check whether extracted links are still valid. Dead link detection (also called broken link checking) sends HTTP requests to each extracted URL and reports the status.

The process uses HTTP HEAD requests rather than GET requests for efficiency. A HEAD request retrieves only the response headers, not the body, making it significantly faster. However, some servers do not support HEAD requests and return 405 (Method Not Allowed). In these cases, the tool falls back to a GET request.

Status codes are interpreted as follows:

200 OK: The link is valid and the resource exists
301/302: The link redirects. The tool should report both the original URL and the destination
404 Not Found: The resource does not exist. This is a broken link
500 Internal Server Error: The server is broken. The link may work later but is currently failing
403 Forbidden: The server is blocking the request. The link may be valid but inaccessible to automated tools
Connection timeout/refused: The server is unreachable. The link is broken from a network perspective

Dead links harm user experience (clicking a broken link is frustrating), waste crawl budget (search engines spending time on non-existent pages), and signal neglect (a site with many broken links appears unmaintained). Regular dead link detection should be part of every site maintenance routine.

Step 5: SEO Auditing with Link Data

Link extraction data feeds directly into SEO auditing workflows. Here are the key analyses that link data enables:

Internal link structure analysis: Map how pages connect to each other. A healthy site has a logical hierarchy where important pages receive many internal links and less important pages are fewer clicks from the homepage. Tools like Screaming Frog and Ahrefs use link extraction data to build site architecture visualizations.

Page authority distribution: Pages with many internal links pointing to them tend to rank higher. By analyzing link counts, you can identify pages that deserve more internal links and pages that are over-linked relative to their importance.

Anchor text analysis: The text inside an anchor tag (<a href="...">anchor text</a>) tells search engines what the linked page is about. Extracting and analyzing anchor text distribution reveals whether your internal links use descriptive, keyword-rich text or generic phrases like "click here."

Nofollow vs. dofollow: Links with the rel="nofollow" attribute tell search engines not to follow the link or pass authority. Auditing the ratio of nofollow to dofollow links on your pages helps ensure you are not accidentally nofollowing important internal links.

External link quality assessment: Evaluate the domains you link to. Links to high-authority educational, government, or industry-leading sites add credibility. Links to low-quality or irrelevant sites should be removed or nofollowed.

Try the Link Extractor Tool →

Building vs. Using a Link Extractor

If you are a developer, building a basic link extractor is straightforward. Python with the requests and BeautifulSoup libraries can extract all links from a page in under 20 lines of code. However, production-quality link extraction requires handling redirects, respecting robots.txt, managing rate limits, parsing JavaScript-rendered content, and providing a useful UI for the results.

For most users, an online tool is the better choice. RiseTop's Link Extractor handles all the technical complexity: it fetches the page, parses the HTML, classifies links by type, checks for broken links, and presents the results in a clean, filterable interface. No coding required, no setup, and it works from any device.

Whether you are auditing your own site, analyzing a competitor's link profile, or debugging a broken navigation menu, a link extractor gives you the data you need to make informed decisions. The key is understanding what the tool does under the hood so you can interpret the results correctly and take meaningful action.

Frequently Asked Questions

How does a link extractor work?A link extractor sends an HTTP GET request to a URL, receives the HTML response, then parses the HTML to find all anchor tags and other elements containing URLs. It extracts the href attribute from links, src from images, and url from CSS references. Extracted URLs are then classified as internal, external, or resource links, and optionally checked for validity.

What is the difference between internal and external links?Internal links point to other pages on the same domain (e.g., /about or https://yoursite.com/contact). External links point to different domains (e.g., https://example.com). Internal links help with site navigation and SEO crawlability, while external links provide additional context and can pass link equity to authoritative sources.

How do I find broken links on my website?Use a link extractor to extract all links from your pages, then run a dead link check. The tool sends HTTP HEAD requests to each URL and checks the response status code. Links returning 404, 500, or connection errors are flagged as broken. RiseTop's Link Extractor can check all extracted links automatically.

Can a link extractor find links in JavaScript?Basic link extractors that parse static HTML cannot find links generated by JavaScript. To extract JavaScript-rendered links, the tool needs a headless browser like Puppeteer or Playwright that executes JavaScript before parsing the DOM. Most online link extractors use static parsing and miss dynamically added links.

Why is link extraction important for SEO?Link extraction reveals your site's internal linking structure, helps identify orphan pages with no internal links, finds broken links that harm user experience and crawl budget, and audits external link quality. A well-structured internal link profile improves crawlability, distributes page authority, and helps search engines understand content relationships.