Learn robots.txt syntax, common patterns, and best practices to optimize your site's crawl efficiency and protect sensitive content.
The robots.txt file is one of the first files search engine crawlers request when visiting your website. It acts as a gatekeeper, telling bots which pages they can and cannot access. A well-configured robots.txt file improves crawl efficiency, protects sensitive resources, and ensures Googlebot focuses on your most important content.
In this guide, you'll learn everything about robots.txt — from basic syntax to advanced patterns — and how to generate the perfect robots.txt file using our free Robots.txt Generator.
Robots.txt is a plain text file stored in the root directory of your website (https://example.com/robots.txt). It follows the Robots Exclusion Protocol (REP) and provides instructions to web crawlers (also called "user agents") about which parts of your site they are allowed to crawl.
Think of robots.txt as a set of traffic rules for search engine bots. It doesn't prevent indexing directly — that requires a noindex meta tag — but it prevents crawling, which effectively keeps pages out of the index in most cases.
<meta name="robots" content="noindex"> on the page itself.
The robots.txt file uses a simple, line-based syntax. Each instruction consists of a directive followed by a value. Here are the core directives:
Specifies which crawler the rule applies to. You can target specific bots or use wildcards.
# Apply to all crawlers
User-agent: *
# Apply only to Googlebot
User-agent: Googlebot
# Apply to Bingbot
User-agent: Bingbot
# Apply to multiple specific bots
User-agent: Googlebot
User-agent: BingbotSpecifies which paths the crawler should not access. A blank Disallow line means everything is allowed.
# Block a specific page
Disallow: /private-page.html
# Block an entire directory
Disallow: /admin/
# Block all pages with a certain parameter
Disallow: /*?session=
# Block everything (complete crawl block)
Disallow: /
# Allow everything (explicit)
Disallow:Explicitly permits access to a path, even if a broader Disallow rule would block it. This is especially useful for allowing specific files within a blocked directory.
User-agent: *
Disallow: /api/
Allow: /api/public/Tells crawlers where to find your XML sitemap. This should be placed at the end of the file.
Sitemap: https://example.com/sitemap.xmlSpecifies a delay (in seconds) between successive requests. Note: Googlebot ignores this directive, but Bingbot respects it.
User-agent: Bingbot
Crawl-delay: 10User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xmlThis is the simplest configuration — all crawlers can access everything. Suitable for most small websites.
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /login/
Disallow: /private/
Disallow: /tmp/
Disallow: /*?utm_*
Disallow: /*?ref=
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xmlThis pattern blocks common administrative paths while keeping the front-end fully crawlable. It also blocks tracking parameters to prevent duplicate URLs.
User-agent: *
Disallow: /admin/
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: facebookexternalhit
Allow: /
Sitemap: https://example.com/sitemap.xmlIn 2026, many websites are blocking AI training crawlers to protect their content. Note that this doesn't prevent these companies from using previously crawled data — it only stops future crawling.
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search/?*
Disallow: /category/*?price*
Disallow: /category/*?color*
Disallow: /category/*?sort*
Sitemap: https://example.com/sitemap.xmlE-commerce sites generate thousands of URLs through faceted navigation (filters, sorting, pagination). Blocking these in robots.txt prevents crawl waste and helps Google focus on your actual product pages.
Robots.txt supports limited pattern matching through the * wildcard and the $ end-of-string anchor.
| Pattern | Matches | Example |
|---|---|---|
* | Any sequence of characters | Disallow: /*.pdf$ |
$ | End of URL string | Disallow: /print/$ |
/*? | Any URL with query parameters | Disallow: /*?session= |
/dir/ | Directory and all contents | Disallow: /api/ |
# Block all PDF files
Disallow: /*.pdf$
# Block all URLs ending in /print/
Disallow: /*/print/$
# Block all URLs containing "sort="
Disallow: /*?sort=Crawl budget refers to the number of URLs Googlebot will crawl on your site within a given timeframe. For large sites, inefficient crawl budget allocation means Google may not discover important new pages, or may waste time crawling low-value URLs.
Robots.txt helps optimize crawl budget by:
You can monitor your crawl budget usage in Google Search Console under Crawl Stats.
noindex meta tags and password protection instead.https://yourdomain.com/robots.txt in a browser to confirm it's accessible.| Feature | Robots.txt | Meta Robots |
|---|---|---|
| Scope | Entire site or directory | Individual page |
| Controls crawling | ✅ Yes | ❌ No |
| Controls indexing | ⚠️ Indirect | ✅ Direct |
| Blocks link equity | ✅ Yes (no crawl = no link passing) | ⚠️ Depends on directive |
| Public visibility | ✅ Visible to everyone | ❌ Hidden in HTML |
| Server-level config | ✅ Single file | ❌ Per-page HTML |
A properly configured robots.txt file is essential for SEO success. It helps search engines crawl your site efficiently, protects sensitive resources, and ensures your crawl budget is spent on pages that matter. Whether you're running a small blog or a large e-commerce platform, understanding robots.txt syntax and best practices gives you direct control over how search engines interact with your content.
Don't leave your crawl strategy to chance — generate a clean, validated robots.txt file in seconds.
Create a perfectly configured robots.txt file with our free generator. Choose from common presets or customize rules for your site.
Try Robots.txt Generator →