Every website needs a robots.txt file. It is one of the first things search engine crawlers look for when they visit your domain, and it serves as a set of instructions telling bots which pages they can and cannot access. Yet many website owners either ignore this file entirely or create one with errors that inadvertently block important content from being indexed.
In this comprehensive guide, you will learn everything about the robots.txt protocol, how to write a proper file using our free robots.txt generator, and how to avoid the most common mistakes that can hurt your search rankings.
๐ ๏ธ Try Our Free Robots.txt Generator
Create a perfectly optimized robots.txt file in seconds โ no coding required.
Generate robots.txt Now
What Is a robots.txt File?
A robots.txt file is a plain text file stored in the root directory of your website (e.g., https://example.com/robots.txt) that communicates with web crawlers and other automated bots. It uses the Robots Exclusion Protocol (REP) to specify which areas of your site should or should not be crawled.
Think of it as a digital "do not disturb" sign for certain parts of your website. While it cannot force a bot to comply (malicious bots will simply ignore it), all legitimate search engines โ including Google, Bing, and Yandex โ respect the directives in your robots.txt file.
Key Concepts
- User-agent: Identifies which crawler the rule applies to (e.g., Googlebot, Bingbot, or * for all crawlers).
- Disallow: Specifies which paths the crawler should not access.
- Allow: Explicitly permits access to a path within a disallowed parent directory.
- Sitemap: Points to the location of your XML sitemap(s).
- Crawl-delay: Specifies a delay (in seconds) between successive crawler requests (supported by some bots like Bingbot).
How to Create a robots.txt File
Creating a robots.txt file is straightforward, but getting it right requires understanding the syntax and common patterns. Here is a step-by-step walkthrough using our generator and manual creation methods.
Step 1: Using the RiseTop Robots.txt Generator
Our online robots.txt generator simplifies the process. Instead of memorizing syntax rules, you can configure your directives through an intuitive interface:
- Navigate to the robots.txt generator tool.
- Select which user-agents you want to target (Googlebot, Bingbot, or all bots).
- Add directories or files you want to disallow (e.g., /admin/, /wp-admin/, /private/).
- Optionally add Allow rules for specific paths within disallowed directories.
- Enter your sitemap URL(s).
- Click "Generate" to get your optimized robots.txt file.
- Copy the output and upload it to your website's root directory.
Step 2: Manual Creation
If you prefer to write the file by hand, create a plain text file named robots.txt and use the following syntax:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 10
This example blocks all crawlers from accessing the /admin/, /private/, and /tmp/ directories, while explicitly allowing access to /admin/public/. It also declares the sitemap location and requests a 10-second delay between requests.
Robots.txt Examples for Common Platforms
WordPress
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /trackback/
Disallow: /?s=
Disallow: /search/
Sitemap: https://example.com/sitemap.xml
This configuration blocks crawlers from WordPress administrative and plugin directories while allowing the AJAX endpoint that many themes and plugins rely on. It also blocks internal search results to prevent duplicate content issues.
E-Commerce (Shopify / WooCommerce)
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search/
Disallow: /*?sort=
Disallow: /*?page=
Disallow: /collections/*?*
Sitemap: https://example.com/sitemap.xml
For e-commerce sites, it is important to block cart, checkout, and account pages โ these provide no SEO value and can waste crawl budget. Blocking sorted and paginated URLs prevents duplicate content from cluttering search results.
Single Page Application (SPA)
User-agent: *
Disallow: /api/
Disallow: /assets/temp/
Disallow: /*.json$
Allow: /
Sitemap: https://example.com/sitemap.xml
SPAs often serve JSON data through API endpoints that should not be indexed. This configuration blocks API routes and temporary asset files while allowing everything else.
Common Mistakes to Avoid
- Blocking CSS and JavaScript files: Google needs access to your CSS and JS to properly render and index your pages. Blocking these resources can lead to poor indexing of your content.
- Using robots.txt to hide sensitive content: robots.txt is public โ anyone can view it. If a URL is disallowed in robots.txt, Google may still show it in search results (without a snippet). Use authentication or noindex meta tags instead.
- Wildcards and pattern matching errors: The * character only works as a wildcard in Google's extended robots.txt syntax. Not all crawlers support it. Be careful with patterns like Disallow: /*?.
- Multiple sitemaps on separate lines: Each sitemap directive must be on its own line. Some generators incorrectly combine them.
- Incorrect file location: The file must be at /robots.txt, not in a subdirectory like /public/robots.txt.
Use Cases for robots.txt
1. Managing Crawl Budget
Larger websites with thousands of pages need to be strategic about which pages search engines crawl. By blocking low-value pages (print versions, duplicate category pages, session IDs), you direct crawlers toward your most important content. This is especially critical for large e-commerce sites with millions of product URL variations.
2. Preventing Indexing of Staging or Dev Sites
Development and staging environments should never appear in search results. A simple robots.txt file with Disallow: / tells all crawlers to stay away. Combine this with HTTP authentication for actual security.
3. Blocking Internal Search Results
Internal search result pages create near-infinite URL combinations that consume crawl budget and create duplicate content. Blocking /search?q= or /?s= keeps crawlers focused on your actual content pages.
4. Reducing Server Load
Aggressive crawlers can put significant load on your server. Using Crawl-delay or disallowing heavy resource directories helps manage traffic from bots while ensuring they still access your important pages.
Frequently Asked Questions
Where should I place my robots.txt file?
Your robots.txt file must be placed in the root directory of your website. For example, if your domain is example.com, the file should be accessible at example.com/robots.txt. Placing it in a subdirectory will not work โ search engine crawlers only look for it at the root level.
Can robots.txt completely hide my pages from Google?
No. robots.txt tells crawlers not to crawl a page, but it does not prevent the page from appearing in search results if other sites link to it. To fully prevent a page from being indexed, use a noindex meta tag or an X-Robots-Tag HTTP header in addition to robots.txt directives.
What happens if I don't have a robots.txt file?
If no robots.txt file exists, search engine crawlers assume they have full permission to crawl all pages on your site. This is fine for most websites, but you may want a robots.txt file to block access to admin areas, internal search results, or resource-heavy pages.
How do I test my robots.txt file?
You can test your robots.txt file using Google Search Console's robots.txt Tester tool, or use RiseTop's free robots.txt generator which includes a validation feature. Simply paste your robots.txt content and the tool will check for syntax errors and potential issues.
What is the difference between Disallow and Noindex?
Disallow in robots.txt prevents crawlers from accessing a URL โ they won't crawl it. Noindex (as a meta tag or HTTP header) tells crawlers not to include the page in search results, even if they do crawl it. For best results, use both together if you want a page neither crawled nor indexed.