Tampa Tech Wire - News and Technology From Around The Bay                  

Robots.txt – When and Why to use it

Facebook
Twitter
LinkedIn
Pinterest
Pocket
WhatsApp
A robots.txt file is a text file read by search engines (and other systems). Also called the Robots Exclusion Protocol, the robots.txt file results from a consensus among early search engine developers.

Warning!

Any mistakes you make in your robots.txt can seriously harm your site, so make sure you read and understand the whole of this article before you dive in.

Table of contents

What is a robots.txt file?

A robots.txt file is a text file read by search engines (and other systems). Also called the Robots Exclusion Protocol, the robots.txt file results from a consensus among early search engine developers. It’s not an official standard set by any standards organization, although all major search engines adhere to it.

A basic robots.txt file might look something like this:

User-Agent: *
Disallow:

Sitemap: https://www.example.com/sitemap_index.xml

What does the robots.txt file do?

Search engines discover and index the web by crawling pages. As they crawl, they discover and follow links. This takes them from site A to site B to site C, and so on. But before a search engine visits any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file. That lets them know which URLs on that site they’re allowed to visit (and which ones they’re not).

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, the crawler should find it at https://www.example.com/robots.txt.

It’s also essential that your robots.txt file is called robots.txt. The name is case sensitive, so get that right, or it won’t work.

Pros and cons of using robots.txt

Pro: managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or, how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think your website has problems with crawl budget, then blocking search engines from ‘wasting’ energy on unimportant parts of your site might mean that they focus instead on the sections that do matter.

It can sometimes be beneficial to block the search engines from crawling problematic sections of your site, especially on sites where a lot of SEO clean-up has to be done. Once you’ve tidied things up, you can let them back in.

A note on blocking query parameters

One situation where crawl budget is crucial is when your site uses a lot of query string parameters to filter or sort lists. Let’s say you have ten different query parameters, each with different values that can be used in any combination (like t-shirts in multiple colors and sizes). This leads to many possible valid URLs, all of which might get crawled. Blocking query parameters from being crawled will help ensure the search engine only spiders your site’s main URLs and won’t go into the enormous trap you’d otherwise create.

Con: not removing a page from search results

Even though you can use the robots.txt file to tell a crawler where it can’t go on your site, you can’t use it to say to a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it; it will just not know what’s on that page. So your result will look like this:

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means that to find the noindex tag, the search engine has to be able to access that page, so don’t block it with robots.txt.

 Noindex directives

It used to be possible to add ‘noindex’ directives in your robots.txt, to remove URLs from Google’s search results, and to avoid these ‘fragments’ showing up. This is no longer supported (and technically, never was).

Con: not spreading link value

If a search engine can’t crawl a page, it can’t spread the link value across the links on that page. It’s a dead-end when you’ve blocked a page in robots.txt. Any link value which might have flowed to (and through) that page is lost.

Robots.txt syntax

 WordPress robots.txt

Yoast has an entire article on how best to setup your robots.txt for WordPress. Don’t forget you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or particular blocks for particular search engines. A search engine spider will always pick the block that best matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: * 
Disallow: / 

User-agent: Googlebot 
Disallow: 

User-agent: bingbot 
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so it’s up to you to write them lowercase or capitalize them. The values are case sensitive, however, /photo/ is not the same as /Photo/. We like to capitalize directives because it makes the file easier (for humans) to read.

The user-agent directive

The first bit of every block of directives is the user-agent, which identifies a specific spider. The user-agent field matches with that specific spider’s (usually longer) user-agent, so, for instance, the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

If you want to tell this crawler what to do, a relatively simple User-agent: Googlebot line will do the trick.

Most search engines have multiple spiders. They will use a specific spider for their normal index, ad programs, images, videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have three sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Here’s a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engineFieldUser-agent
BaiduGeneralbaiduspider
BaiduImagesbaiduspider-image
BaiduMobilebaiduspider-mobile
BaiduNewsbaiduspider-news
BaiduVideobaiduspider-video
BingGeneralbingbot
BingGeneralmsnbot
BingImages & Videomsnbot-media
BingAdsadidxbot
GoogleGeneralGooglebot
GoogleImagesGooglebot-Image
GoogleMobileGooglebot-Mobile
GoogleNewsGooglebot-News
GoogleVideoGooglebot-Video
GoogleAdSenseMediapartners-Google
GoogleAdWordsAdsBot-Google
Yahoo!Generalslurp
YandexGeneralyandex

The disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying which parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything so that a spider can access all sections of your site.

Facebook
Twitter
LinkedIn
Pinterest
Pocket
WhatsApp
Your subscription could not be saved. Please try again.
Thanks for subscribing!

Newsletter

Never miss any important news. Subscribe to our newsletter.

Leave a Reply

Your subscription could not be saved. Please try again.
Thanks for subscribing!

Newsletter

Never miss any important news. Subscribe to our newsletter.

Latest Jobs

Recent News

Popular

Blog Subscriber Form