Robots.txt explained: a simple guide to web crawler rules
Robots.txt guides which parts of a site crawlers can or can't visit. This post covers the basics, common directives, and best practices.
What is robots.txt?
Robots.txt is a plain text file placed at the root of a domain (for example, https://example.com/robots.txt) that tells well-behaved web crawlers which parts of the site to crawl or ignore. It is a voluntary convention; not every bot uses it, and it is not a security barrier.
How robots.txt works
Robots.txt uses a simple, group-based syntax. Each group starts with a User-agent line, followed by one or more Disallow or Allow rules. Groups are separated by blank lines. Lines such as Sitemap can appear in the file to point crawlers to your sitemap.
The basics of syntax
A group looks like: User-agent: <name-or-wildcard> Disallow: /path/ Allow: /path/that-might-be-not-disallowed/
And you can have multiple groups in a single file. Paths are relative to the site root.
The scope and the caveat about security
Robots.txt controls crawling, not access control. A page can still be served to anyone; blocking it in robots.txt will not stop someone from fetching the URL directly if they know it.
Common directives
User-agent
Specifies which crawlers a group applies to. Use * for all bots or name a specific bot.
Disallow
Tells a crawler not to visit a path. An empty value means "no restriction" in that group.
Allow
Used to override a Disallow for a sub-path in some bots (not universally supported).
Crawl-delay
Requests a delay between requests from a crawler. Not all bots honor this.
Sitemap
Points to the sitemap file. Typically placed at the end of the file or within a group.
Examples
Example 1: Block all bots from /private/
User-agent: *
Disallow: /private/
Example 2: Allow a specific path for a search engine while blocking others
User-agent: *
Disallow: /private/
User-agent: Googlebot
Disallow:
Example 3: Include a sitemap and block a folder
User-agent: *
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
Limitations and best practices
- Place robots.txt at the site root so it is discoverable by crawlers.
- Use it to guide crawling behavior, not to hide sensitive data securely.
- Do not rely on robots.txt to protect confidential information; use authentication for that.
- Keep the file up-to-date and consistent with any per-page noindex strategies (see noindex notes below).
Testing and verification
- Fetch your robots.txt directly in a browser to confirm the content.
- Use search engine tools like the webmaster console’s robots.txt tester if available.
- Check server logs to verify which sections are being crawled and which are not.
SEO and indexing considerations
- Disallowing pages prevents Google and other engines from crawling them, which often means they won’t be indexed.
- If you want a page to appear in search results but not be crawled for content, robots.txt alone won’t suffice; use a noindex meta tag or other signals on the page (but note that noindex typically requires the page to be crawlable).
- For assets like images, blocking by robots.txt will keep them from being crawled and indexed but won’t necessarily stop access.
Security and privacy notes
- Robots.txt is public; anyone can read it. Do not put secrets in this file.
- It is not a security boundary. If a page is sensitive, require authentication or proper access controls.
- Some bots ignore robots.txt, so do not rely on it to enforce privacy.
Share This Article
Spread the word on social media
Anne Kanana
Comments
No comments yet. Be the first to share your thoughts!