Home / SEO / Robots.txt Guide: What to Block, What to Allow & Common Mistakes

Robots.txt Guide: What to Block, What to Allow & Common Mistakes

By: Irina Shvaya | June 4, 2026

A single misplaced line in your robots.txt file can hide your entire website from Google. It sounds dramatic, but we’ve seen it happen — a staging directive left in place after launch, a broad Disallow rule that blocks critical pages, or a forgotten wildcard that shuts out Googlebot entirely. According to Google, the robots.txt file is one of the first things their crawler checks before indexing a single page on your site. In this robots.txt guide, we break down exactly what this file does, walk through every directive you need to know, and show you the most common mistakes that silently sabotage your SEO.

Key Takeaways

txt tells search engine crawlers which URLs they can and cannot access on your site.
Always block admin pages, staging environments, internal search results, and duplicate URL parameters.
Never block CSS, JavaScript, or image files — Google needs them to render and understand your pages.
Test your robots.txt using Google Search Console’s robots.txt tester before pushing changes live.
A misconfigured robots.txt is one of the most common technical SEO issues we find during an SEO audit.

What Is Robots.txt and Why Does It Matter for SEO?

Robots.txt is a plain text file that lives at the root of your domain (e.g., yoursite.com/robots.txt). It follows the Robots Exclusion Protocol — a standard that tells web crawlers which parts of your site they’re allowed to access and which parts they should skip. Think of it as a bouncer at the door of your website. It doesn’t lock anyone out physically (crawlers can still access blocked URLs if they find links to them elsewhere), but well-behaved bots like Googlebot respect the rules you set. Why it matters for robots.txt SEO:

Crawl budget management: Every site gets a limited number of pages Googlebot will crawl in a given session. Robots.txt helps you direct that budget toward the pages that actually matter.
Preventing index bloat: Blocking low-value pages like internal search results or admin panels keeps your index lean and focused.
Protecting sensitive areas: While it’s not a security measure, it keeps bots from wasting time on staging environments, login pages, and backend scripts.

If your robots.txt is misconfigured, you might be wasting crawl budget on junk pages — or worse, blocking Google from your most important content. This is one of the first things we check when diagnosing crawl errors and indexing issues on client sites.

Robots.txt Syntax Explained: The Core Directives

The robots.txt file uses a simple, line-based syntax. Here are the four directives you’ll use most often:

User-agent

Specifies which crawler the following rules apply to. User-agent: Googlebot User-agent: * The asterisk (*) targets all crawlers. You can also set rules for specific bots like Bingbot, Googlebot-Image, or AdsBot-Google.

Disallow

Tells the specified crawler not to access a particular URL path. Disallow: /admin/ Disallow: /staging/ An empty Disallow: directive means “allow everything” — which is a common source of confusion.

Allow

Overrides a Disallow rule for a specific path within a blocked directory. Disallow: /resources/ Allow: /resources/guides/ This blocks everything in /resources/ except the /resources/guides/ subdirectory.

Sitemap

Points crawlers to your XML sitemap so they can discover all your important pages. Sitemap: https://yoursite.com/sitemap.xml Always include this directive. Your sitemap and your robots.txt should work together — your sitemap lists the pages you want indexed, and your robots.txt blocks the ones you don’t. For a deeper dive, see our guide on sitemaps and how they boost crawl efficiency.

What to Block in Robots.txt

Not every page on your site deserves to be crawled. Here’s what you should typically block:

1. Admin and Backend Pages

Disallow: /wp-admin/ Disallow: /admin/ Disallow: /backend/ These pages offer zero value to search engines and waste crawl budget.

2. Staging and Development Environments

Disallow: /staging/ Disallow: /dev/ If your staging site is on a subdomain (e.g., staging.yoursite.com), make sure that subdomain has its own robots.txt blocking all access. We regularly find staging sites that are fully indexed by Google — a duplicate content nightmare.

3. Internal Search Results Pages

Disallow: /search/ Disallow: /?s= Internal search result pages create near-infinite URL combinations and offer thin, duplicated content. Google specifically recommends blocking these.

4. Duplicate URL Parameters

Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?ref= Faceted navigation and tracking parameters generate thousands of duplicate pages. Block the parameters that don’t produce unique content.

5. Thank-You and Confirmation Pages

Disallow: /thank-you/ Disallow: /order-confirmation/ These pages have no search value and can expose conversion flow details.

What NOT to Block in Robots.txt

This is where most of the damage happens. Here’s what you should never block:

Get a FREE Audit

We'll perform a comprehensive SEO, AEO, GEO & CRO audit of your website — completely free — and show you exactly how to outrank your competitors.

Don't have a site yet? Get in touch →

CSS and JavaScript Files

# WRONG — do not do this Disallow: /assets/wp-content/themes/ Disallow: /assets/wp-includes/ Disallow: /*.css$ Disallow: /*.js$ Google needs access to your CSS and JavaScript to render your pages properly. If Googlebot can’t load your stylesheets or scripts, it sees a broken, unstyled version of your page — and ranks it accordingly. Google has explicitly stated that blocking CSS and JS can negatively impact indexing.

Images

Don’t block image directories unless you have a specific legal reason to do so. Images contribute to your overall SEO through Google Image Search, and they help Google understand page context.

Key Content Pages

This sounds obvious, but overly broad Disallow rules catch important pages more often than you’d think. A rule like Disallow: /blog/draft will also block /blog/drafting-contracts-guide/ if you’re not careful with trailing slashes.

Your Sitemap

Never block the path to your XML sitemap. Crawlers need to access it to find your pages efficiently.

How to Test Your Robots.txt File

Writing your robots.txt is only half the job. You need to verify it works the way you intend.

Google Search Console Robots.txt Tester

Open Google Search Console.
Navigate to Settings > Crawling > robots.txt.
Enter the URL you want to test.
The tool will tell you whether the URL is blocked or allowed under your current rules.

Manual Testing Checklist

Before pushing any robots.txt changes to production, verify these:

☐ Is your homepage allowed? (/)
☐ Are your main category and service pages allowed?
☐ Can Googlebot access your CSS and JS files?
☐ Is your sitemap URL included and accessible?
☐ Are staging-specific rules removed from the production file?

Third-Party Crawl Tools

Tools like Screaming Frog and Sitebulb let you crawl your own site while respecting (or ignoring) robots.txt rules. This is a great way to see exactly which pages are blocked before Google encounters them. A comprehensive SEO audit should always include a robots.txt review.

Common Robots.txt Mistakes That Hurt Your SEO

We’ve audited hundreds of websites, and these are the robots.txt mistakes we see over and over:

1. Blocking the Entire Site After Launch

# Left over from staging — disastrous in production User-agent: * Disallow: / This single line tells every crawler to stay away from every page. It’s the nuclear option, and it’s usually a leftover from a pre-launch staging environment. Always review your robots.txt on launch day.

2. Using Robots.txt Instead of Noindex

Robots.txt prevents crawling, but it doesn’t prevent indexing. If other sites link to a page you’ve blocked in robots.txt, Google may still index the URL — it just won’t know what’s on it. If you want to remove a page from search results, use a noindex meta tag instead. For more on how crawling and indexing work together, see our complete technical SEO guide.

3. Conflicting Rules Without Priority Understanding

User-agent: * Disallow: /products/ Allow: /products/featured/ Google uses the most specific rule that matches the URL. In this case, /products/featured/ would be allowed because it’s more specific. But not all crawlers handle conflicts the same way — always test.

4. Forgetting the Trailing Slash

Disallow: /blog/tags # Blocks /blog/tags but also /blog/tags-explained/ Disallow: /blog/tags/ # Only blocks URLs under /blog/tags/ A missing trailing slash can accidentally block pages that share a similar URL prefix. Be precise.

5. Not Including a Sitemap Reference

Roughly 20% of the sites we audit have no Sitemap: directive in their robots.txt. It’s a missed opportunity to help crawlers discover your content faster.

A Clean Robots.txt Template

Here’s a solid starting point for most websites: User-agent: * Disallow: /wp-admin/ Disallow: /search/ Disallow: /cart/ Disallow: /checkout/ Disallow: /thank-you/ Disallow: /?s= Disallow: /*?sort= Disallow: /*?filter= Allow: /wp-admin/admin-ajax.php Sitemap: https://yoursite.com/sitemap.xml Adapt this to your CMS, site structure, and business needs. There’s no one-size-fits-all robots.txt — what matters is that you’re intentional about every rule.

Frequently Asked Questions

Does robots.txt stop pages from appearing in Google search results?

No. Robots.txt prevents crawling, not indexing. If Google finds links to a blocked page from other sources, it may still show the URL in search results — just without a description. To keep a page out of search results entirely, use a noindex meta tag or an X-Robots-Tag HTTP header instead.

How often does Google check my robots.txt file?

Google typically re-fetches your robots.txt file about once every 24 hours, though it can cache it for up to a few days. If you make urgent changes, you can use the robots.txt tester in Google Search Console to request a re-fetch and verify the updated rules are being read.

Can I use robots.txt to block AI crawlers and scrapers?

Yes. You can add rules for specific AI-related user agents like GPTBot, CCBot, or Google-Extended. For example, User-agent: GPTBot followed by Disallow: / will block OpenAI’s crawler from accessing your site. Keep in mind that only crawlers that respect the Robots Exclusion Protocol will honor these rules.

Where should the robots.txt file be located?

Your robots.txt file must be at the root of your domain — for example, https://yoursite.com/robots.txt. It won’t work if placed in a subdirectory. Each subdomain also needs its own robots.txt file if you want to control crawling on those subdomains separately. Not sure if your robots.txt is set up right? A single misconfigured rule could be hiding your best content from Google. Our team reviews robots.txt files, sitemaps, crawl errors, and every other technical SEO factor as part of our SEO audit. Or explore our SEO packages for ongoing technical SEO management. Ready to fix what’s holding your site back? Contact eSEOspace today.

Put this into action with eSEOspace

We help businesses grow with website development that actually performs. Explore the services behind this guide:

Custom Development WordPress Dev CRM Development App Development All Website Development →

Book a free strategy call →

Subscribe To Our Newsletter

Subscribe To Our Newsletter

Blog

Robots.txt Guide: What to Block, What to Allow & Common Mistakes

Key Takeaways

What Is Robots.txt and Why Does It Matter for SEO?

Robots.txt Syntax Explained: The Core Directives

User-agent

Disallow

Allow

Sitemap

What to Block in Robots.txt

1. Admin and Backend Pages

2. Staging and Development Environments

3. Internal Search Results Pages

4. Duplicate URL Parameters

5. Thank-You and Confirmation Pages

What NOT to Block in Robots.txt

Get a FREE Audit

CSS and JavaScript Files

Images

Key Content Pages

Your Sitemap

How to Test Your Robots.txt File

Google Search Console Robots.txt Tester

Manual Testing Checklist

Third-Party Crawl Tools

Common Robots.txt Mistakes That Hurt Your SEO

1. Blocking the Entire Site After Launch

2. Using Robots.txt Instead of Noindex

3. Conflicting Rules Without Priority Understanding

4. Forgetting the Trailing Slash

5. Not Including a Sitemap Reference

A Clean Robots.txt Template

Frequently Asked Questions

Related guides

Put this into action with eSEOspace

Get a FREE GEO/AEO/SEO Audit

Great — your audit is on the way!

You're all set! ✓

Meet the Authors

Irina Shvaya

Benjamin Gunther

You Might Also like to Read

A Simple Guide: What Is WordPress Learning Management System

Why Cheap Proxies Are Costing Businesses More Than They Think

6 Fast Decision-Making Hacks for Project Managers

Why Speaker Selection Can Define Your Business Event Success

Why HTTP and HTTPS Still Matter in Modern Web Infrastructure

The Growing Risk of Identity Theft in Modern Companies

Recommended Services

Related Articles

Get a FREE Audit

Analyzing Your Website...

📩 Where should we send your report?

You're All Set!

Design

Development

SEO / GEO / AEO

Maintenance

Industries

Company

Contact Us

Locations