Robots.txt Tester: Control How Search Engines Crawl Your Site

SEO Tools 11 min read

What is Robots.txt?

Robots.txt is a plain text file that webmasters create to instruct search engine crawlers (also called robots, bots, or spiders) about which pages or sections of their website should or should not be crawled. The file uses the Robots Exclusion Protocol (REP), a standard that has governed crawler behavior since 1994. When a search engine crawler arrives at your website, it checks for a robots.txt file at the root URL path (/robots.txt) before crawling any other pages. Based on the directives in this file, the crawler decides which URLs to request and which to skip.

Robots.txt is one of the most fundamental files in technical SEO because it directly controls how search engines discover and interact with your website's content. A misconfigured robots.txt file can prevent search engines from crawling important pages (causing them to disappear from search results), waste crawl budget on low-value pages, or accidentally expose sensitive content you intended to block. Our Robots.txt Tester lets you analyze any robots.txt file, test URLs against its rules, and identify potential issues before they impact your search rankings.

Robots.txt Syntax and Directives

User-agent Directive

The User-agent directive specifies which crawler the subsequent rules apply to. Each rule block starts with a User-agent line followed by one or more Allow or Disallow directives. To target all crawlers, use the wildcard value: User-agent: *. To target a specific crawler like Googlebot, use its exact name: User-agent: Googlebot. You can create separate rule blocks for different crawlers to apply different restrictions. For example, you might allow Googlebot full access while blocking a less important crawler from certain directories.

Allow and Disallow Directives

The Allow directive explicitly permits crawling of a URL path, while Disallow blocks it. If a path is not mentioned in any rule, it is allowed by default. The path matching is prefix-based, meaning Disallow: /admin/ blocks all URLs starting with "/admin/", including /admin/dashboard, /admin/users, and /admin/settings/edit. To block the entire site, use Disallow: /. To allow everything, you can either omit Disallow directives or use an empty Disallow: Disallow:.

Sitemap Directive

The Sitemap directive tells crawlers where to find your XML sitemap. This is not a rule but rather a hint that helps crawlers discover all your important pages efficiently. Place the full URL to your sitemap file: Sitemap: https://example.com/sitemap.xml. You can include multiple Sitemap directives if you have multiple sitemap files. Search engines don't guarantee they'll crawl every URL in the sitemap, but providing it significantly improves crawl coverage, especially for large websites with deep page hierarchies.

Wildcards and Patterns

Modern robots.txt supports pattern matching using the asterisk (*) wildcard, which matches any sequence of characters, and the dollar sign ($) anchor, which matches the end of a URL. For example, Disallow: /*?session= blocks all URLs containing a session parameter. Disallow: /*.pdf$ blocks all PDF files. Allow: /blog/*/post- allows specific blog post patterns while blocking others. These pattern-matching capabilities make robots.txt rules significantly more flexible and precise than simple path matching.

Complete Robots.txt Examples

Basic Robots.txt

A basic robots.txt file for a typical website looks like this:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /tmp/
Disallow: /internal/

Sitemap: https://example.com/sitemap.xml

This allows all crawlers to access the entire site except for the /admin/, /tmp/, and /internal/ directories, and points them to the XML sitemap for efficient page discovery. This is a good starting template for most websites.

WordPress Robots.txt

WordPress sites have specific directories and paths that should typically be blocked from crawling:

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /trackback/
Disallow: /?s=
Disallow: /*?replytocom=
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

This configuration blocks WordPress admin and system files while allowing the admin-ajax.php endpoint that many themes and plugins depend on. It also blocks internal search results and trackback URLs, which can cause duplicate content issues.

E-commerce Robots.txt

E-commerce sites need to manage crawl budget carefully because they often have thousands of product pages with filter and sort parameters that create massive URL spaces:

User-agent: *
Allow: /
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search/?*
Disallow: /*?sort=
Disallow: /*?price=
Disallow: /*?page=100
Disallow: /api/

Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-categories.xml

Robots.txt vs. Meta Robots: Key Differences

One of the most common SEO misconceptions is confusing robots.txt with the meta robots tag. They serve different purposes and behave differently. Robots.txt controls whether crawlers can access and crawl a URL, but it does not prevent the URL from being indexed. If a blocked URL is linked from other pages (either on your site or external sites), search engines may still index it and display it in search results, though typically without a page description since they couldn't crawl it.

The meta robots tag with the noindex directive, placed in the HTML <head> of a page, tells search engines not to index that page even if they crawl it. This is the correct way to prevent a page from appearing in search results. For complete protection against indexing, you should use both robots.txt to block crawling AND the meta robots noindex tag on the page itself. However, be aware that if robots.txt blocks a page, crawlers cannot read the meta robots tag, creating a contradiction. The recommended approach is to allow crawling of pages you want to deindex (so crawlers can see the noindex tag) but use robots.txt to block crawling of pages you don't want crawlers to waste resources on, regardless of indexing.

Crawl Budget Management

Crawl budget is the number of URLs Googlebot (and other crawlers) is willing to crawl on your site within a given timeframe. For small sites with fewer than a few thousand pages, crawl budget is rarely a concern. But for large sites with hundreds of thousands or millions of pages, crawl budget management becomes critical. If Google can only crawl 10,000 pages per day on your site but you have 100,000 product pages, you need to ensure Google is crawling your most important pages rather than wasting budget on low-value URLs.

Robots.txt is the primary tool for managing crawl budget. Block internal search result pages, faceted navigation URLs with filter parameters, session ID URLs, printer-friendly versions of pages, auto-generated tag or category archive pages with thin content, and API endpoints that produce machine-readable responses. By directing crawlers away from these low-value URLs, you ensure more of your crawl budget is spent on your most important, content-rich pages that actually drive search traffic and conversions.

Common Robots.txt Mistakes

Blocking CSS and JavaScript Files

One of the most damaging robots.txt mistakes is blocking crawlers from accessing CSS and JavaScript files. Google renders pages using a web rendering service (like a headless Chrome browser) to see the page as users do. If CSS and JS files are blocked, Google cannot fully render the page, which means it cannot see content that is loaded dynamically via JavaScript, cannot understand the page's visual layout and styling, and may assign lower quality scores to pages it cannot fully render. Always allow access to CSS and JS files, especially the ones loaded by your critical pages.

Wildcards Blocking Too Much

Overly broad Disallow rules can accidentally block important pages. For example, Disallow: /blog/ blocks all blog content, which is usually the opposite of what you want. Disallow: /*? blocks all URLs with query parameters, which might block important tracking or analytics URLs. Always test your robots.txt rules carefully using a testing tool to verify that important pages are not being blocked. The specificity of your rules matters: use the most specific path possible to avoid collateral blocking.

Forgetting to Update After Site Changes

When you restructure your website, add new sections, change URL patterns, or migrate to a new CMS, your robots.txt file needs to be updated accordingly. A robots.txt file that was correct for your old site structure may block or allow the wrong paths on the new structure. Make robots.txt review a standard part of any website migration or major restructuring checklist. Test the updated robots.txt file thoroughly before and after the migration to ensure continuous proper crawl behavior.

How to Use a Robots.txt Tester

Analyzing Your Robots.txt File

A robots.txt tester tool fetches your robots.txt file and parses its rules, displaying each directive in a structured format that's easy to review. Our Robots.txt Tester goes beyond simple display by simulating how different user-agents would interpret your rules, testing specific URLs against your rules to see whether they would be allowed or blocked, highlighting syntax errors or ambiguous directives, checking for common mistakes like blocking CSS/JS files, and verifying that your sitemap directive points to a valid, accessible sitemap file.

Testing URLs Against Rules

The most valuable feature of a robots.txt tester is the ability to test specific URLs against your rules. Enter any URL path from your site, and the tool will tell you whether the specified user-agent would be allowed or blocked from crawling that URL. This is essential for verifying that your rules work as intended. Test your most important pages (homepage, key landing pages, product pages) to ensure they are allowed. Test your blocked paths (admin areas, internal search) to confirm they are properly blocked. Test edge cases like URLs with query parameters, pagination, and special characters to ensure your rules handle them correctly.

Ready to test your robots.txt file? Use our free Robots.txt Tester to analyze your rules and verify that search engines can crawl your important pages correctly.

Frequently Asked Questions

What is a robots.txt file?

A robots.txt file is a plain text file placed in the root directory of a website that tells search engine crawlers which pages or sections they should or should not crawl. It uses the Robots Exclusion Protocol and must be accessible at yoursite.com/robots.txt. Search engines check this file before crawling any page on your site.

Can robots.txt hide pages from search results?

No, robots.txt only controls crawling, not indexing. If a URL is blocked by robots.txt but linked from other pages, search engines may still index it and display it in results (without a description). To truly prevent a page from appearing in search results, use the meta robots tag with 'noindex' directive. Combine robots.txt blocking with noindex tags for complete protection.

How do I test my robots.txt file?

You can test your robots.txt file using Google Search Console's robots.txt tester (now part of the URL Inspection tool), our online Robots.txt Tester tool which simulates how crawlers interpret your rules, or by using curl to fetch the file directly (curl yoursite.com/robots.txt). Always test after making changes to ensure you haven't accidentally blocked important pages.

Where should the robots.txt file be placed?

The robots.txt file must be placed in the root directory of your website and accessible at the exact path /robots.txt. For example, on example.com, it must be at example.com/robots.txt. It cannot be placed in a subdirectory like example.com/docs/robots.txt. If your site serves from both www and non-www versions, both should have identical robots.txt files.

What happens if I don't have a robots.txt file?

If no robots.txt file exists, search engine crawlers assume full access and will crawl everything they discover. This is generally fine for most websites, but having a robots.txt file is still recommended because it lets you control crawl budget, block access to admin areas and internal search pages, point crawlers to your XML sitemap, and reduce server load from unnecessary crawling.

Try the Tool →