Welcome to eSEOspace! Let us get to know you!

    Get a FREE Audit

    We'll perform a comprehensive SEO, AEO, GEO & CRO audit of your website — completely free.

    Don't have a site yet? Click here

    Analyzing Your Website...

    Our AI is scanning your site for 75+ ranking factors


    📩 Where should we send your report?

    Fill this out while we finish — your personalized audit will be emailed directly to you.

    🔒 Your information is safe. We never share your data with third parties.

    You're All Set!

    We're building your personalized audit report right now. You'll receive it at within the next few minutes.

    Robots.txt Guide: What to Block, What to Allow & Common Mistakes

    By: Irina Shvaya | June 4, 2026
    A single misplaced line in your robots.txt file can hide your entire website from Google. It sounds dramatic, but we’ve seen it happen — a staging directive left in place after launch, a broad Disallow rule that blocks critical pages, or a forgotten wildcard that shuts out Googlebot entirely. According to Google, the robots.txt file is one of the first things their crawler checks before indexing a single page on your site. In this robots.txt guide, we break down exactly what this file does, walk through every directive you need to know, and show you the most common mistakes that silently sabotage your SEO. Key Takeaways
    • txt tells search engine crawlers which URLs they can and cannot access on your site.
    • Always block admin pages, staging environments, internal search results, and duplicate URL parameters.
    • Never block CSS, JavaScript, or image files — Google needs them to render and understand your pages.
    • Test your robots.txt using Google Search Console’s robots.txt tester before pushing changes live.
    • A misconfigured robots.txt is one of the most common technical SEO issues we find during an SEO audit.

    What Is Robots.txt and Why Does It Matter for SEO?

    Robots.txt is a plain text file that lives at the root of your domain (e.g., yoursite.com/robots.txt). It follows the Robots Exclusion Protocol — a standard that tells web crawlers which parts of your site they’re allowed to access and which parts they should skip. Think of it as a bouncer at the door of your website. It doesn’t lock anyone out physically (crawlers can still access blocked URLs if they find links to them elsewhere), but well-behaved bots like Googlebot respect the rules you set. Why it matters for robots.txt SEO:
    • Crawl budget management: Every site gets a limited number of pages Googlebot will crawl in a given session. Robots.txt helps you direct that budget toward the pages that actually matter.
    • Preventing index bloat: Blocking low-value pages like internal search results or admin panels keeps your index lean and focused.
    • Protecting sensitive areas: While it’s not a security measure, it keeps bots from wasting time on staging environments, login pages, and backend scripts.
    If your robots.txt is misconfigured, you might be wasting crawl budget on junk pages — or worse, blocking Google from your most important content. This is one of the first things we check when diagnosing crawl errors and indexing issues on client sites.

    Robots.txt Syntax Explained: The Core Directives

    The robots.txt file uses a simple, line-based syntax. Here are the four directives you’ll use most often:

    User-agent

    Specifies which crawler the following rules apply to. User-agent: Googlebot User-agent: * The asterisk (*) targets all crawlers. You can also set rules for specific bots like Bingbot, Googlebot-Image, or AdsBot-Google.

    Disallow

    Tells the specified crawler not to access a particular URL path. Disallow: /admin/ Disallow: /staging/ An empty Disallow: directive means “allow everything” — which is a common source of confusion.

    Allow

    Overrides a Disallow rule for a specific path within a blocked directory. Disallow: /resources/ Allow: /resources/guides/ This blocks everything in /resources/ except the /resources/guides/ subdirectory.

    Sitemap

    Points crawlers to your XML sitemap so they can discover all your important pages. Sitemap: https://yoursite.com/sitemap.xml Always include this directive. Your sitemap and your robots.txt should work together — your sitemap lists the pages you want indexed, and your robots.txt blocks the ones you don’t. For a deeper dive, see our guide on sitemaps and how they boost crawl efficiency.

    What to Block in Robots.txt

    Not every page on your site deserves to be crawled. Here’s what you should typically block:

    1. Admin and Backend Pages

    Disallow: /wp-admin/ Disallow: /admin/ Disallow: /backend/ These pages offer zero value to search engines and waste crawl budget.

    2. Staging and Development Environments

    Disallow: /staging/ Disallow: /dev/ If your staging site is on a subdomain (e.g., staging.yoursite.com), make sure that subdomain has its own robots.txt blocking all access. We regularly find staging sites that are fully indexed by Google — a duplicate content nightmare.

    3. Internal Search Results Pages

    Disallow: /search/ Disallow: /?s= Internal search result pages create near-infinite URL combinations and offer thin, duplicated content. Google specifically recommends blocking these.

    4. Duplicate URL Parameters

    Disallow: /*?sort= Disallow: /*?filter= Disallow: /*?ref= Faceted navigation and tracking parameters generate thousands of duplicate pages. Block the parameters that don’t produce unique content.

    5. Thank-You and Confirmation Pages

    Disallow: /thank-you/ Disallow: /order-confirmation/ These pages have no search value and can expose conversion flow details.

    What NOT to Block in Robots.txt

    This is where most of the damage happens. Here’s what you should never block:

    CSS and JavaScript Files

    # WRONG — do not do this Disallow: /wp-content/themes/ Disallow: /wp-includes/ Disallow: /*.css$ Disallow: /*.js$ Google needs access to your CSS and JavaScript to render your pages properly. If Googlebot can’t load your stylesheets or scripts, it sees a broken, unstyled version of your page — and ranks it accordingly. Google has explicitly stated that blocking CSS and JS can negatively impact indexing.

    Images

    Don’t block image directories unless you have a specific legal reason to do so. Images contribute to your overall SEO through Google Image Search, and they help Google understand page context.

    Key Content Pages

    This sounds obvious, but overly broad Disallow rules catch important pages more often than you’d think. A rule like Disallow: /blog/draft will also block /blog/drafting-contracts-guide/ if you’re not careful with trailing slashes.

    Your Sitemap

    Never block the path to your XML sitemap. Crawlers need to access it to find your pages efficiently.

    How to Test Your Robots.txt File

    Writing your robots.txt is only half the job. You need to verify it works the way you intend.

    Google Search Console Robots.txt Tester

    1. Open Google Search Console.
    2. Navigate to Settings > Crawling > robots.txt.
    3. Enter the URL you want to test.
    4. The tool will tell you whether the URL is blocked or allowed under your current rules.

    Manual Testing Checklist

    Before pushing any robots.txt changes to production, verify these:
    • ☐ Is your homepage allowed? (/)
    • ☐ Are your main category and service pages allowed?
    • ☐ Can Googlebot access your CSS and JS files?
    • ☐ Is your sitemap URL included and accessible?
    • ☐ Are staging-specific rules removed from the production file?

    Third-Party Crawl Tools

    Tools like Screaming Frog and Sitebulb let you crawl your own site while respecting (or ignoring) robots.txt rules. This is a great way to see exactly which pages are blocked before Google encounters them. A comprehensive SEO audit should always include a robots.txt review.

    Common Robots.txt Mistakes That Hurt Your SEO

    We’ve audited hundreds of websites, and these are the robots.txt mistakes we see over and over:

    1. Blocking the Entire Site After Launch

    # Left over from staging — disastrous in production User-agent: * Disallow: / This single line tells every crawler to stay away from every page. It’s the nuclear option, and it’s usually a leftover from a pre-launch staging environment. Always review your robots.txt on launch day.

    2. Using Robots.txt Instead of Noindex

    Robots.txt prevents crawling, but it doesn’t prevent indexing. If other sites link to a page you’ve blocked in robots.txt, Google may still index the URL — it just won’t know what’s on it. If you want to remove a page from search results, use a noindex meta tag instead. For more on how crawling and indexing work together, see our complete technical SEO guide.

    3. Conflicting Rules Without Priority Understanding

    User-agent: * Disallow: /products/ Allow: /products/featured/ Google uses the most specific rule that matches the URL. In this case, /products/featured/ would be allowed because it’s more specific. But not all crawlers handle conflicts the same way — always test.

    4. Forgetting the Trailing Slash

    Disallow: /blog/tags    # Blocks /blog/tags but also /blog/tags-explained/ Disallow: /blog/tags/   # Only blocks URLs under /blog/tags/ A missing trailing slash can accidentally block pages that share a similar URL prefix. Be precise.

    5. Not Including a Sitemap Reference

    Roughly 20% of the sites we audit have no Sitemap: directive in their robots.txt. It’s a missed opportunity to help crawlers discover your content faster.

    A Clean Robots.txt Template

    Here’s a solid starting point for most websites: User-agent: * Disallow: /wp-admin/ Disallow: /search/ Disallow: /cart/ Disallow: /checkout/ Disallow: /thank-you/ Disallow: /?s= Disallow: /*?sort= Disallow: /*?filter= Allow: /wp-admin/admin-ajax.php Sitemap: https://yoursite.com/sitemap.xml Adapt this to your CMS, site structure, and business needs. There’s no one-size-fits-all robots.txt — what matters is that you’re intentional about every rule.

    Frequently Asked Questions

    Does robots.txt stop pages from appearing in Google search results?

    No. Robots.txt prevents crawling, not indexing. If Google finds links to a blocked page from other sources, it may still show the URL in search results — just without a description. To keep a page out of search results entirely, use a noindex meta tag or an X-Robots-Tag HTTP header instead.

    How often does Google check my robots.txt file?

    Google typically re-fetches your robots.txt file about once every 24 hours, though it can cache it for up to a few days. If you make urgent changes, you can use the robots.txt tester in Google Search Console to request a re-fetch and verify the updated rules are being read.

    Make Your Website Competitive.

    Leverage our expertise in Website Design + SEO Marketing, and spend your time doing what you love to do!

    Can I use robots.txt to block AI crawlers and scrapers?

    Yes. You can add rules for specific AI-related user agents like GPTBot, CCBot, or Google-Extended. For example, User-agent: GPTBot followed by Disallow: / will block OpenAI’s crawler from accessing your site. Keep in mind that only crawlers that respect the Robots Exclusion Protocol will honor these rules.

    Where should the robots.txt file be located?

    Your robots.txt file must be at the root of your domain — for example, https://yoursite.com/robots.txt. It won’t work if placed in a subdirectory. Each subdomain also needs its own robots.txt file if you want to control crawling on those subdomains separately. Not sure if your robots.txt is set up right? A single misconfigured rule could be hiding your best content from Google. Our team reviews robots.txt files, sitemaps, crawl errors, and every other technical SEO factor as part of our SEO audit. Or explore our SEO packages for ongoing technical SEO management. Ready to fix what’s holding your site back? Contact eSEOspace today.

    Make Your Website Competitive.

    Leverage our expertise in Website Design + SEO Marketing, and spend your time doing what you love to do!

    You Might Also like to Read