Robots.txt Best Practices: Complete Guide with Examples

Technical SEOApril 12, 202610 min read

What Is Robots.txt?

The robots.txt file is a plain text file placed at the root of your website (e.g., https://risetop.top/robots.txt) that instructs web crawlers — including Googlebot, Bingbot, and others — which pages or sections of your site they can or cannot access. It's one of the first files search engine crawlers look for when they visit your domain.

Think of robots.txt as a "robots exclusion protocol" — it tells bots where they're welcome and where they should stay away. It's not a security mechanism (anyone can still access blocked URLs directly), but it's a powerful tool for managing crawl budget, preventing indexing of sensitive pages, and directing crawler traffic efficiently.

Basic Syntax and Structure

A robots.txt file uses a simple structure with user-agent declarations followed by allow or disallow directives. Here's the fundamental format:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://risetop.top/sitemap.xml

The User-agent line specifies which crawler the rules apply to. Using * applies rules to all crawlers. Disallow tells crawlers not to access specific paths, while Allow explicitly permits access (useful for overriding a broader Disallow rule). The Sitemap directive points crawlers to your XML sitemap.

Essential Robots.txt Directives

User-Agent

This identifies the crawler. Common values include Googlebot, Bingbot, Slurp (Yahoo), and * (all crawlers). You can create separate rule blocks for different crawlers, which is useful when you want to block certain bots while allowing Google full access.

Disallow

This specifies paths that crawlers should not access. Leaving Disallow empty means all pages are allowed. Using a single forward slash (Disallow: /) blocks the entire site. You can block specific directories, file types, or URL patterns.

User-agent: *
Disallow: /api/
Disallow: /tmp/
Disallow: /*.json$
Disallow: /search?q=

Allow

The Allow directive overrides a Disallow rule. This is particularly useful when you want to block a directory but allow specific files within it.

User-agent: Googlebot
Disallow: /resources/
Allow: /resources/public/

Crawl-Delay

Some crawlers support Crawl-delay to slow down crawling, specified in seconds. This can be useful for servers with limited resources, but Googlebot ignores this directive — instead, use Google Search Console's crawl rate setting.

Sitemap

Including your sitemap URL in robots.txt helps crawlers discover it faster. This is one of the most important directives and should be present in every robots.txt file.

Robots.txt Best Practices for 2026

1. Always Include Your Sitemap

This is the single most impactful thing you can do with robots.txt. Pointing crawlers to your sitemap ensures they can discover all your important pages efficiently.

2. Block Admin and Internal Pages

Prevent crawlers from wasting crawl budget on login pages, admin panels, staging environments, internal search results, and duplicate content. For e-commerce sites, block cart, checkout, and account pages.

Disallow: /wp-admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Disallow: /*?sort=
Disallow: /*?page=

3. Don't Block CSS and JavaScript

Google needs access to your CSS and JS files to render your pages properly. Blocking these resources can lead to poor indexing, as Googlebot won't see the fully rendered version of your site. If you previously blocked /*.css$ or /*.js$, remove those rules immediately.

4. Use Separate Rules for Different Crawlers

If you want to block aggressive crawlers while keeping Google happy, create separate user-agent blocks. Place Google-specific rules before the wildcard rule, as crawlers follow the most specific matching rule.

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /api/
Disallow: /internal/

5. Test Before Deploying

Google Search Console has a dedicated robots.txt testing tool that shows you exactly how Googlebot interprets your file. Always test changes before deploying, as a misconfigured Disallow rule can accidentally de-index your entire site.

Common Robots.txt Mistakes

Blocking important pages — A trailing space or missing slash can accidentally block unintended URLs
No sitemap reference — Missing the Sitemap directive means crawlers have to discover pages through links alone
Blocking CSS/JS — This prevents Google from rendering your pages properly for indexing
Using robots.txt instead of noindex — If a page is already indexed and you want it removed, use a noindex meta tag, not robots.txt
Case sensitivity — URLs in robots.txt are case-sensitive on most servers

Robots.txt vs Meta Robots Tag

Robots.txt controls whether crawlers can access a URL, while the meta robots tag controls whether they can index it. A page blocked by robots.txt can still appear in search results if other pages link to it — it just won't have a description. To truly prevent indexing, combine robots.txt blocking with a noindex meta tag on the page itself. See our ranking factors guide for more on how crawl efficiency impacts rankings.

Conclusion

A well-configured robots.txt file is essential for SEO. It helps search engines crawl your site efficiently, prevents waste of crawl budget on unimportant pages, and ensures your most valuable content gets indexed quickly. Review your robots.txt today, test it in Google Search Console, and make sure your sitemap is properly referenced. For a complete overview of technical SEO, check out our blog.