The Complete Guide to Robots.txt: Everything You Need to Know in 2026
Last updated: April 2026 · 13 min read
A robots.txt file is one of the first things search engine crawlers look for when they visit your website. It's a simple text file that tells crawlers which pages they can and cannot access, making it a foundational element of technical SEO and website management.
Despite its simplicity, a poorly configured robots.txt can cause serious SEO problems — from blocking important pages from being indexed, to wasting crawl budget on low-value URLs. This guide covers everything you need to know about robots.txt files, including syntax rules, best practices, common mistakes, and how to generate one correctly.
What Is a Robots.txt File?
A robots.txt file is a plain text file placed in the root directory of your website (e.g., https://example.com/robots.txt). It follows the Robots Exclusion Protocol (REP) and provides instructions to web robots — primarily search engine crawlers like Googlebot, Bingbot, and others — about which URLs they should or shouldn't crawl.
Key characteristics of robots.txt:
- Must be located at the root of your domain:
yourdomain.com/robots.txt - Must be a plain text file (UTF-8 encoding)
- Is publicly accessible to anyone
- File size limit: Google enforces a maximum of 500KB (previously 500KB was recommended; now larger files are truncated)
- Only applies to crawl behavior — it does not prevent pages from being indexed if they're linked from elsewhere
Important: robots.txt controls crawling, not indexing. If you want to prevent a page from appearing in search results, use the noindex meta tag or response header instead.
Robots.txt Syntax and Directives
The robots.txt file consists of one or more rules, each containing a user-agent line (identifying the crawler) and one or more directive lines (allow or disallow rules).
Basic Syntax
User-agent: [crawler-name]
Disallow: [URL-path]
Allow: [URL-path]
Sitemap: [sitemap-URL]
User-Agent
The User-agent line specifies which crawler the rule applies to. Common user agents include:
| User-Agent | Crawler |
|---|---|
| Googlebot | Google's web crawler |
| Bingbot | Microsoft Bing's crawler |
| Slurp | Yahoo's crawler (now merged with Bingbot) |
| DuckDuckBot | DuckDuckGo's crawler |
| Baiduspider | Baidu's crawler |
| YandexBot | Yandex's crawler |
| * | Matches all crawlers (wildcard) |
Disallow
The Disallow directive specifies URL paths that the crawler should not access:
# Block a specific directory
Disallow: /private/
# Block a specific file
Disallow: /admin/login.html
# Block all pages (empty value means block nothing)
Disallow:
# Block everything
Disallow: /
Allow
The Allow directive explicitly permits access to a URL, even if it falls under a broader Disallow rule. This is useful when you want to block a directory but allow a specific file within it:
User-agent: Googlebot
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap
The Sitemap directive tells crawlers where to find your XML sitemap. This is not technically part of the REP but is universally supported:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml
Crawl-Delay
The Crawl-delay directive specifies a delay (in seconds) between successive crawler requests. Google ignores this directive (use Search Console to manage crawl rate instead), but Bing and other crawlers respect it:
User-agent: Bingbot
Crawl-delay: 10
Path Matching Rules
Understanding how robots.txt matches paths is essential for writing correct rules:
- Exact match —
Disallow: /page.htmlblocks only that specific file. - Prefix match —
Disallow: /adminblocks/admin,/admin/,/admin-panel, and any URL starting with/admin. - Trailing slash matters —
Disallow: /admin/blocks/admin/and/admin/anythingbut NOT/admin(without the slash). - Wildcard (*) —
Disallow: /*.pdf$blocks all URLs ending in.pdf. - End anchor ($) —
Disallow: /page$blocks/pagebut not/page.html.
Complete Robots.txt Examples
Basic Website
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
WordPress Site
User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /?s=
Disallow: /search/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml
E-Commerce Site
User-agent: *
Allow: /
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /compare/
Disallow: /*?sort=
Disallow: /*?page=
Disallow: /*.pdf$
User-agent: Googlebot
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-categories.xml
SEO Best Practices for Robots.txt
1. Block Admin and Private Areas
Prevent crawlers from wasting crawl budget on admin pages, login forms, user accounts, and staging areas. These pages have no SEO value and crawling them reduces the time crawlers spend on your important content pages.
2. Block Duplicate and Low-Value Content
Common targets for blocking include: search result pages (/?s=), faceted navigation URLs with sort/filter parameters, print versions of pages, tag pages with thin content, and pagination beyond the first few pages. Use the * wildcard to efficiently block URL patterns.
3. Always Include Your Sitemap
The Sitemap directive helps crawlers discover your content faster. Include all sitemaps at the bottom of your robots.txt file. This is especially important for new websites that don't have many inbound links yet.
4. Don't Block CSS, JS, or Image Files
In 2026, search engines need access to CSS, JavaScript, and image files to properly render and understand your pages. Blocking these resources can lead to poor indexing, as Google won't be able to see the fully rendered version of your pages. This was a common mistake years ago that still causes problems today.
5. Test Before Deploying
Always test your robots.txt using Google Search Console's robots.txt tester or the robots.txt Testing Tool. A single typo can accidentally block your entire site from being crawled.
6. Use robots.txt for Crawl Budget Management
For large websites (100,000+ pages), crawl budget is a real concern. Use robots.txt to block low-value URLs and ensure crawlers spend their time on your most important content. Track crawl stats in Google Search Console to monitor how efficiently Google is crawling your site.
Common Robots.txt Mistakes
Mistake 1: Using robots.txt to Deindex Pages
This is the most dangerous mistake. robots.txt prevents crawling, but if a page is linked from another site, Google may still index it — showing the URL in search results without any snippet or title. To truly prevent indexing, use the noindex meta tag instead.
<!-- Use this to prevent indexing -->
<meta name="robots" content="noindex, follow">
Mistake 2: Disallowing Everything
# ❌ WRONG — blocks the entire site
User-agent: *
Disallow: /
This is often caused by a missing newline or an accidental edit. Always verify your robots.txt after making changes.
Mistake 3: Wildcard Misuse
# ❌ WRONG — this blocks ALL URLs because * matches everything
Disallow: *
# ✅ CORRECT — to block specific patterns
Disallow: /*?session=
Disallow: /*/temp/
Mistake 4: Multiple Sitemap Directives Scattered Throughout the File
While this technically works, it's cleaner and less error-prone to place all Sitemap directives at the end of the file after all user-agent rules.
Mistake 5: Not Updating After Site Changes
After a site migration, CMS change, or URL restructuring, your robots.txt may contain outdated rules that block new pages or fail to block old ones. Audit your robots.txt after every major site change.
Robots.txt vs Meta Robots Tags
These are complementary tools with different purposes:
| Feature | robots.txt | Meta Robots Tag |
|---|---|---|
| Scope | Directory or pattern-level | Page-level |
| Controls crawling | Yes | No |
| Controls indexing | No (indirect only) | Yes |
| Location | Root directory file | HTML <head> or HTTP header |
| Applies to | All URLs matching the pattern | Individual pages |
Best practice: Use robots.txt to control what crawlers can access, and use meta robots tags to control whether specific pages are indexed.
How to Generate a Robots.txt File
You can create a robots.txt file manually in any text editor, but using a generator tool ensures correct syntax and helps you avoid common mistakes. Risetop's Robots.txt Generator lets you configure rules through a simple interface, handles wildcards and sitemap directives automatically, and generates a ready-to-deploy file.
Deployment Checklist
- Generate or write your robots.txt file
- Test it with Google Search Console's testing tool
- Upload it to your site's root directory (
/robots.txt) - Verify it's accessible at
https://yourdomain.com/robots.txt - Monitor crawl stats in Search Console for the next few days
Frequently Asked Questions
Where should I place my robots.txt file?
It must be in the root directory of your domain, accessible at https://yourdomain.com/robots.txt. Subdomains and subdirectories cannot have their own robots.txt — you'd need a separate domain or subdomain for that.
Can I have multiple robots.txt files?
Only one robots.txt file per host (domain or subdomain). However, you can have separate robots.txt files on different subdomains (e.g., blog.example.com/robots.txt and shop.example.com/robots.txt).
How long does it take for robots.txt changes to take effect?
Google typically re-reads robots.txt within 24 hours, but it can take up to a few days for changes to fully propagate. Use the "Fetch as Google" tool in Search Console to request an immediate re-read.
Does robots.txt prevent pages from appearing in search results?
No. robots.txt only controls crawling. If a URL is linked from another site, Google may still index it. To prevent a page from appearing in search results, use a noindex meta tag or HTTP response header.
What happens if I don't have a robots.txt file?
If no robots.txt file exists, crawlers assume they have full access to all pages. This is fine for most websites, but you should still create one to declare your sitemap and explicitly manage access to admin and low-value areas.
Conclusion
A well-configured robots.txt file is a cornerstone of technical SEO. It helps search engines crawl your site efficiently, prevents them from wasting time on low-value pages, and provides a clear entry point to your sitemaps. By understanding the syntax, following best practices, and avoiding common mistakes, you can ensure that crawlers focus on your most important content.
Remember: robots.txt controls crawling, not indexing. Combine it with meta robots tags for complete control over how search engines interact with your content. Use a generator tool to create your file correctly, test it thoroughly, and monitor the results in Search Console.