The Complete Guide to Robots.txt: Everything You Need to Know in 2026

Last updated: April 2026 · 13 min read

A robots.txt file is one of the first things search engine crawlers look for when they visit your website. It's a simple text file that tells crawlers which pages they can and cannot access, making it a foundational element of technical SEO and website management.

Despite its simplicity, a poorly configured robots.txt can cause serious SEO problems — from blocking important pages from being indexed, to wasting crawl budget on low-value URLs. This guide covers everything you need to know about robots.txt files, including syntax rules, best practices, common mistakes, and how to generate one correctly.

What Is a Robots.txt File?

A robots.txt file is a plain text file placed in the root directory of your website (e.g., https://example.com/robots.txt). It follows the Robots Exclusion Protocol (REP) and provides instructions to web robots — primarily search engine crawlers like Googlebot, Bingbot, and others — about which URLs they should or shouldn't crawl.

Key characteristics of robots.txt:

Must be located at the root of your domain: yourdomain.com/robots.txt
Must be a plain text file (UTF-8 encoding)
Is publicly accessible to anyone
File size limit: Google enforces a maximum of 500KB (previously 500KB was recommended; now larger files are truncated)
Only applies to crawl behavior — it does not prevent pages from being indexed if they're linked from elsewhere

Important: robots.txt controls crawling, not indexing. If you want to prevent a page from appearing in search results, use the noindex meta tag or response header instead.

Robots.txt Syntax and Directives

The robots.txt file consists of one or more rules, each containing a user-agent line (identifying the crawler) and one or more directive lines (allow or disallow rules).

Basic Syntax

User-agent: [crawler-name]
Disallow: [URL-path]
Allow: [URL-path]
Sitemap: [sitemap-URL]

User-Agent

The User-agent line specifies which crawler the rule applies to. Common user agents include:

User-Agent	Crawler
Googlebot	Google's web crawler
Bingbot	Microsoft Bing's crawler
Slurp	Yahoo's crawler (now merged with Bingbot)
DuckDuckBot	DuckDuckGo's crawler
Baiduspider	Baidu's crawler
YandexBot	Yandex's crawler
*	Matches all crawlers (wildcard)

Disallow

The Disallow directive specifies URL paths that the crawler should not access:

# Block a specific directory
Disallow: /private/

# Block a specific file
Disallow: /admin/login.html

# Block all pages (empty value means block nothing)
Disallow:

# Block everything
Disallow: /

Allow

The Allow directive explicitly permits access to a URL, even if it falls under a broader Disallow rule. This is useful when you want to block a directory but allow a specific file within it:

User-agent: Googlebot
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap

The Sitemap directive tells crawlers where to find your XML sitemap. This is not technically part of the REP but is universally supported:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml

Crawl-Delay

The Crawl-delay directive specifies a delay (in seconds) between successive crawler requests. Google ignores this directive (use Search Console to manage crawl rate instead), but Bing and other crawlers respect it:

User-agent: Bingbot
Crawl-delay: 10

Path Matching Rules

Understanding how robots.txt matches paths is essential for writing correct rules:

Exact match — Disallow: /page.html blocks only that specific file.
Prefix match — Disallow: /admin blocks /admin, /admin/, /admin-panel, and any URL starting with /admin.
Trailing slash matters — Disallow: /admin/ blocks /admin/ and /admin/anything but NOT /admin (without the slash).
Wildcard (*) — Disallow: /*.pdf$ blocks all URLs ending in .pdf.
End anchor ($) — Disallow: /page$ blocks /page but not /page.html.

Complete Robots.txt Examples

Basic Website

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

WordPress Site

User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /?s=
Disallow: /search/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

E-Commerce Site

User-agent: *
Allow: /
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /wishlist/
Disallow: /compare/
Disallow: /*?sort=
Disallow: /*?page=
Disallow: /*.pdf$

User-agent: Googlebot
Crawl-delay: 2

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-categories.xml

SEO Best Practices for Robots.txt

1. Block Admin and Private Areas

Prevent crawlers from wasting crawl budget on admin pages, login forms, user accounts, and staging areas. These pages have no SEO value and crawling them reduces the time crawlers spend on your important content pages.

2. Block Duplicate and Low-Value Content

Common targets for blocking include: search result pages (/?s=), faceted navigation URLs with sort/filter parameters, print versions of pages, tag pages with thin content, and pagination beyond the first few pages. Use the * wildcard to efficiently block URL patterns.

3. Always Include Your Sitemap

The Sitemap directive helps crawlers discover your content faster. Include all sitemaps at the bottom of your robots.txt file. This is especially important for new websites that don't have many inbound links yet.

4. Don't Block CSS, JS, or Image Files

In 2026, search engines need access to CSS, JavaScript, and image files to properly render and understand your pages. Blocking these resources can lead to poor indexing, as Google won't be able to see the fully rendered version of your pages. This was a common mistake years ago that still causes problems today.

5. Test Before Deploying

Always test your robots.txt using Google Search Console's robots.txt tester or the robots.txt Testing Tool. A single typo can accidentally block your entire site from being crawled.

6. Use robots.txt for Crawl Budget Management

For large websites (100,000+ pages), crawl budget is a real concern. Use robots.txt to block low-value URLs and ensure crawlers spend their time on your most important content. Track crawl stats in Google Search Console to monitor how efficiently Google is crawling your site.

Common Robots.txt Mistakes

Mistake 1: Using robots.txt to Deindex Pages

This is the most dangerous mistake. robots.txt prevents crawling, but if a page is linked from another site, Google may still index it — showing the URL in search results without any snippet or title. To truly prevent indexing, use the noindex meta tag instead.

<!-- Use this to prevent indexing -->
<meta name="robots" content="noindex, follow">

Mistake 2: Disallowing Everything

# ❌ WRONG — blocks the entire site
User-agent: *
Disallow: /

This is often caused by a missing newline or an accidental edit. Always verify your robots.txt after making changes.

Mistake 3: Wildcard Misuse

# ❌ WRONG — this blocks ALL URLs because * matches everything
Disallow: *

# ✅ CORRECT — to block specific patterns
Disallow: /*?session=
Disallow: /*/temp/

Mistake 4: Multiple Sitemap Directives Scattered Throughout the File

While this technically works, it's cleaner and less error-prone to place all Sitemap directives at the end of the file after all user-agent rules.

Mistake 5: Not Updating After Site Changes

After a site migration, CMS change, or URL restructuring, your robots.txt may contain outdated rules that block new pages or fail to block old ones. Audit your robots.txt after every major site change.

Robots.txt vs Meta Robots Tags

These are complementary tools with different purposes:

Feature	robots.txt	Meta Robots Tag
Scope	Directory or pattern-level	Page-level
Controls crawling	Yes	No
Controls indexing	No (indirect only)	Yes
Location	Root directory file	HTML <head> or HTTP header
Applies to	All URLs matching the pattern	Individual pages

Best practice: Use robots.txt to control what crawlers can access, and use meta robots tags to control whether specific pages are indexed.

How to Generate a Robots.txt File

You can create a robots.txt file manually in any text editor, but using a generator tool ensures correct syntax and helps you avoid common mistakes. Risetop's Robots.txt Generator lets you configure rules through a simple interface, handles wildcards and sitemap directives automatically, and generates a ready-to-deploy file.

Deployment Checklist

Generate or write your robots.txt file
Test it with Google Search Console's testing tool
Upload it to your site's root directory (/robots.txt)
Verify it's accessible at https://yourdomain.com/robots.txt
Monitor crawl stats in Search Console for the next few days

Frequently Asked Questions

Where should I place my robots.txt file?

It must be in the root directory of your domain, accessible at https://yourdomain.com/robots.txt. Subdomains and subdirectories cannot have their own robots.txt — you'd need a separate domain or subdomain for that.

Can I have multiple robots.txt files?

Only one robots.txt file per host (domain or subdomain). However, you can have separate robots.txt files on different subdomains (e.g., blog.example.com/robots.txt and shop.example.com/robots.txt).

How long does it take for robots.txt changes to take effect?

Google typically re-reads robots.txt within 24 hours, but it can take up to a few days for changes to fully propagate. Use the "Fetch as Google" tool in Search Console to request an immediate re-read.

Does robots.txt prevent pages from appearing in search results?

No. robots.txt only controls crawling. If a URL is linked from another site, Google may still index it. To prevent a page from appearing in search results, use a noindex meta tag or HTTP response header.

What happens if I don't have a robots.txt file?

If no robots.txt file exists, crawlers assume they have full access to all pages. This is fine for most websites, but you should still create one to declare your sitemap and explicitly manage access to admin and low-value areas.

Conclusion

A well-configured robots.txt file is a cornerstone of technical SEO. It helps search engines crawl your site efficiently, prevents them from wasting time on low-value pages, and provides a clear entry point to your sitemaps. By understanding the syntax, following best practices, and avoiding common mistakes, you can ensure that crawlers focus on your most important content.

Remember: robots.txt controls crawling, not indexing. Combine it with meta robots tags for complete control over how search engines interact with your content. Use a generator tool to create your file correctly, test it thoroughly, and monitor the results in Search Console.