Robots.txt Generator: Control Search Engine Crawling

Compare, configure, and optimize robots.txt across platforms — from syntax to strategy

SEOApril 13, 202610 min read

What Is robots.txt and Why It Matters

Robots.txt is a plain text file stored at the root of your website (/robots.txt) that tells web crawlers which pages and resources they can or cannot access. It's the first file any search engine bot requests when visiting your site, and it serves as your primary tool for managing crawl behavior at scale.

While robots.txt cannot control indexing directly (only crawling), it has a profound impact on SEO through crawl budget management. Every time Google crawls a low-value page — a tag archive, a filtered product view, an admin panel — it spends crawl budget that could be used on your important content pages. Properly configured robots.txt ensures crawlers focus their limited resources on the pages that drive your business.

Default robots.txt Files Across Major Platforms

Different CMS platforms generate different default robots.txt files. Understanding these defaults helps you identify what's already handled and what needs customization.

WordPress (Default)

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/wp-sitemap.xml

WordPress blocks the admin directory but allows AJAX endpoints. Modern WordPress also auto-generates a sitemap reference. The default is reasonably SEO-friendly but doesn't block plugin directories, trackback URLs, or date-based archives.

Shopify (Default)

User-agent: *
Disallow: /a/downloads/*
Disallow: /a/t/*
Disallow: /a/recommendations/*
Allow: /a/downloads/*/download
Disallow: /cart
Disallow: /orders
Disallow: /checkout
Disallow: /account
Disallow: /search
Disallow: /collections/*sort_by*
Disallow: /*/collections/*sort_by*
Disallow: /collections/*+*
Disallow: /collections/*%2B*
Disallow: /collections/*%2b*
Disallow: /collections/*-*
Disallow: /collections/*%2D*
Disallow: /collections/*%2d*
Disallow: /products/*-*
Disallow: /products/*%2D*
Sitemap: https://example.com/sitemap.xml

Shopify's default is impressively comprehensive. It blocks sorting parameters, filtered collections, cart/checkout/account pages, and various pattern-based duplicate URLs. This is one of the best default configurations among major platforms.

Wix (Default)

User-agent: *
Disallow: /v2/
Disallow: /paid-plans/
Disallow: /my-account/
Disallow: /search
Disallow: /account/
Disallow: /notifications/
Disallow: /blog-rss.php
Disallow: /blog-external-rss.php
Disallow: /store-rss.php
Sitemap: https://example.com/sitemap.xml

Wix blocks internal system paths and user-specific pages. It's a clean default but doesn't address dynamic URL parameters or paginated archive pages, which may need manual attention for larger sites.

Joomla (Default)

User-agent: *
Disallow: /administrator/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/

Joomla blocks all system directories by default, which is thorough for security but lacks SEO-specific optimizations. There's no sitemap reference, and no handling of common SEO issues like duplicate content from component URLs.

Robots.txt Directive Syntax Reference

The robots.txt protocol uses a simple key-value syntax. Understanding each directive gives you precise control over crawler behavior.

DirectivePurposeExample
User-agentSpecifies which crawler the rules apply toUser-agent: Googlebot
DisallowBlocks access to matching pathsDisallow: /private/
AllowExplicitly permits access (overrides Disallow)Allow: /private/public/
SitemapDeclares the location of XML sitemapsSitemap: https://example.com/sitemap.xml
Crawl-delaySets minimum delay between requests (seconds)Crawl-delay: 10
Request-rateLimits request frequencyRequest-rate: 1/5
HostSpecifies preferred domain (Yandex)Host: https://example.com
Clean-paramRemoves session/tracking parameters (Yandex)Clean-param: sid /pages/

Rules are evaluated top-to-bottom within each User-agent group. The most specific match wins. If both Allow and Disallow match a URL, the longer rule takes precedence — this is how Allow: /wp-admin/admin-ajax.php overrides Disallow: /wp-admin/.

Advanced robots.txt Patterns

Pattern Matching with Wildcards

The asterisk (*) matches any sequence of characters within a path. The dollar sign ($) matches the end of a URL. These patterns enable powerful blocking rules:

# Block all URLs containing a question mark (dynamic parameters)
Disallow: /*?

# Block all PDF files
Disallow: /*.pdf$

# Block all URLs ending with /page/ followed by a number (pagination)
Disallow: /page/[0-9]+$

# Block all product sorting URLs
Disallow: /*?sort=*

# Block all WordPress date-based archives
Disallow: /*/[0-9]{4}/[0-9]{2}/

Managing Crawl Budget

For sites with more than a few thousand pages, crawl budget becomes a real concern. Google allocates a limited "crawl budget" based on your site's perceived authority and server capacity. Every disallowed path saves crawl budget that gets redistributed to allowed paths.

Priority candidates for disallowing include:

XML Sitemap Integration

The Sitemap directive in robots.txt helps crawlers discover your sitemaps faster. You can declare multiple sitemaps, which is useful for large sites that split sitemaps by content type:

User-agent: *
Disallow: /admin/
Disallow: /api/

Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
Sitemap: https://example.com/sitemap-images.xml

Platform-Specific User-Agent Rules

Different search engines can receive different instructions. This is useful for blocking aggressive bots while allowing Google full access:

# Allow Google full access
User-agent: Googlebot
Disallow:

# Allow Bing full access
User-agent: Bingbot
Disallow:

# Block aggressive or unknown crawlers
User-agent: *
Disallow: /api/
Disallow: /internal/
Crawl-delay: 10

Common search engine user agents: Googlebot (Google web), Googlebot-Image (Google images), Bingbot (Bing), Slurp (Yahoo/Bing), DuckDuckBot (DuckDuckGo), Baiduspider (Baidu), YandexBot (Yandex).

Common Robots.txt Mistakes

Using the RiseTop Robots.txt Generator

Writing a correct robots.txt file requires understanding the syntax rules, path matching behavior, and interaction between directives. The RiseTop Robots.txt Generator handles all of this through a simple interface — no syntax knowledge required.

Generate a production-ready robots.txt file in seconds — with preset templates for popular CMS platforms.

Try Robots.txt Generator →

The tool provides pre-configured templates for WordPress, Shopify, Wix, and custom sites. You can add, edit, and remove rules through checkboxes and input fields, and the tool generates valid robots.txt syntax in real time. It includes a validation check that warns you about common errors like blocking CSS/JS files or forgetting the sitemap directive.

Frequently Asked Questions

Where should robots.txt be placed on my website?

The robots.txt file must be placed in the root directory of your website, accessible at https://yourdomain.com/robots.txt. It cannot be placed in a subdirectory — https://yourdomain.com/blog/robots.txt will be ignored. If the file doesn't exist at the root, crawlers assume no restrictions exist and will attempt to crawl everything.

Does robots.txt prevent a page from appearing in Google search results?

No. robots.txt prevents crawlers from accessing a URL, but it does not prevent the URL from appearing in search results. If other pages link to the blocked URL, Google may show it as a listing without a description (just the URL). To completely remove a page from search results, use the noindex meta tag (<meta name="robots" content="noindex">) or remove the page entirely.

What is the difference between robots.txt disallow and noindex?

Disallow in robots.txt tells crawlers not to crawl the page, but Google can still index and display the URL in search results based on external links pointing to it. Noindex in a meta tag tells Google not to show the page in results, but the page must be crawlable for Google to see the noindex directive. For complete removal, some SEOs use both: disallow to stop crawling and noindex as a fallback for any page that gets crawled through a loophole.

How often do search engines check robots.txt?

Google typically re-crawls robots.txt every 24 hours, or sooner if the file's HTTP caching headers indicate it has changed. Major changes to robots.txt are usually reflected in Google's crawling behavior within 1-2 days. Bing follows a similar schedule. Use Google Search Console's robots.txt testing tool to verify that your file is accessible and that Googlebot can parse it correctly.

Can I have different rules for different search engines?

Yes. You can create separate User-agent groups for different crawlers. Googlebot, Bingbot, and other crawlers each respect their own User-agent section. This is useful for allowing full crawl access to Google while blocking aggressive or low-quality crawlers, or for testing different crawl strategies with different engines. Rules are processed independently for each User-agent block.