Plannedscape Postings

Image

Robots.txt File
You Have One On Your Website, Right?

Posted by Charlie Recksieck on 2022-01-06
This article is to make sure you understand the role of the robots.txt file in websites. If you have anything to do with servers, running your own website, or SEO and marketing ... then you've really got to know this stuff. If those roles have nothing to do with you & you have no native curiosity about robots.txt files, then you're free to go. See you soon.


What Is A Robots.Txt File

It's a simple file in the root folder of your website that tells services (crawlers like Google and other search engines and applications like Twitter) where they should be looking for folders and files on your website which tells them what files and folders are for public consumption and what should be listed on Google.

By declaring the folders you want to be crawled, you are also telling them specific files or subfolders you do NOT want crawled. This gives you some control over files and pages you have on your domain/site. The "disallowed" areas will still be there on your site if somebody knows the URL. That's unlikely and it's good to remember that we are concerned with what Google crawls and indexes here, not actual data security (robots.txt isn't a tool to keep people out, it's more to not have Google tell people the pages exist).


Creating The File

This really needs to be a "flat" text file. Basically, no non-UTF-8 characters. In other words, do not create this from Microsoft Word or word processor. Your PC should have Wordpad or Notepad in it already, that's great; Mac users should create in the TextEdit application. (If you want an even better text editor for daily use, we recommend Textpad or Notepad++.) But if you have to edit this in a word processor like MS Word - then make sure to do a Save As to make it a UTF-8 text file.

This file has to be named robots.txt, no wiggle room there.

Keep in mind that folder references are case-sensitive. (That's a whole other topic for another day about case-sensitive URls on your site and in your .htaccess file.)

Where To Place It: If goes in the root folder of the domain, traditionally.


What It Looks Like



User Agents and Disallow

As you can see from our example above, we are disallowing crawling from both folders and also individual files. Since the robots.txt file is in the root folder of this domain, the URL references are all relative to the root folder.

Also, you can see that we can control access paths for specific web crawlers (e.g. Twitterbot) vs. all crawlers (the * asterisk specification you see). Additionally in this case, we didn't want Google to index some images used in their BlogImages folder but we wanted to specifically allow the Twitter crawler of Twitterbot to be able to access that folder, so we granted access with a specific "allow" command. (The reason here is to allow Twitter to show preview images in a directory.)


Possible Uses For Disallows In Robots.txt


  • Blog posts that haven't posted yet (they are on the server, but have a specified go-live date)

  • Using subfolders to recreate the entire site, which allows having a full testing sandbox or a staging version of the site so everybody internally can test the site - but it won't be visible to Google.

  • Avoiding "Duplicated Content" ... the same page sometimes can be reached with different URLs on a site; we only want them to be indexed once. For instance, in WordPress we would want to make sure that Google only indexes the presented version of WordPress pages and NOT index the actual location which may be www.yoursite.com/wp-admin/ directory

  • Protecting image directories

  • Disallowing entire file types (images, videos - e.g. "Disallow: /*.mp4$")

  • Supporting folders like javascript and css folders




Sitemaps

Though it might seem like a dated concept, "sitemaps" are still a good thing. Here's a great definition. If you have a large amount of content on your site, XML sitemaps give you more ability to tell crawlers where to look.

You just place the sitemap location in your robots.txt file like this:

Sitemap: http://www.bigfellas.net/sitemap.xml


To read more on placement of sitemap XML files, check out this solid primer .


If Things Have Been Indexed That You Don't Want

If you messed up the placement of your files previously and some are publicly indexed that you don't want (and you don't want to change your whole folder hierarchy), there is a procedure to tell Google to remove certain pages from its public index. Here: https://pearanalytics.com/how-to-remove-pages-from-googles-index/


Levels Of Privacy Pages

I want to reiterate, robots.txt disallowing does not prevent anybody from being able to see the disallowed folders and files, it just keeps crawlers out. But by not having these pages indexed, it's less likely that people would know those pages are there and make you a target. Here's a quick definition of levels of security on your page:


  • Ones you're allowing for everybody

  • Areas disallowed from crawling

  • Password-protected folders / Password-protected pages

  • Log-in required pages

  • Pages/folders that can only be viewed from "white-listed" IP addresses (a great precaution for sensitive data or testing sites)



We strongly discourage having any part of your "cloud" for keeping centralized files or folders being within folders that make up your website. As we described in our article about Google dorking - some people make the mistake of leaving private spreadsheets in a public folder. Don't be that guy. Also, read that article if you want to learn how to take advantage of others leaving things like budgets or contact lists up in their site.