Dec 24, 2011

How to Create a WordPress Friendly robots.txt File

How to Create a WordPress Friendly robots.txt File  阅读原文»

Let's get one thing clear. Robots.txt isn't just a fancy file for webmaster-purists and professional SEOs. In fact, every WordPress developer should know a thing or two about the file and why it's so important for every blog's SEO.

robot

So first, here's the big question:

What is robots.txt and why is it important?

Speaking as the captain obvious: it's simply a file. But there is one interesting thing about it. It isn't displayed to the actual visitors anywhere on the blog itself.

Instead, it sits in the root directory of the blog and serves only one purpose. It is the file that search engines look at before they start crawling the contents of a blog. And the reason for looking at it is to find information on what they should and shouldn't be crawling.

So in essence, by using this file you can inform search engines what you want them to index and rank, and what you DON'T want them to index and rank.

The truth is that not every page (or area) of a blog is worth ranking. As a webmaster or a person working with WordPress you have to be able to identify those areas and use robots.txt as a place where you can speak to search engines directly, and let them know what's going on.

Creating robots.txt for WordPress

First of all, let me tackle the actual guidelines which you can find at codex.wordpress.org � this page in particular: Robots.txt Optimization. There's an example file. Here's the thing … don't use it as a template!

I'm not saying that it's completely bad, but it can create a lot of problems for some WP blogs. It all depends on your settings. Things like permalinks, category and tag bases. That's why you need to create robots.txt for each individual blog and be careful when you're dealing with a template of any kind.

Things you should always block

There are some parts of every WP blog that should always be blocked: the "cgi-bin" directory and the standard WP directories.

The "cgi-bin" directory is present on every web server, and it's the place where CGI scripts can be installed and then ran. Nowadays, some servers don't even allow access to this directory, but it surely won't do you any harm to include it in the Disallow directives inside the robots.txt file.

There are 3 standard WP directories (wp-admin, wp-content, wp-includes). You should block them because, essentially, there's nothing there that search engines might consider being interesting.

But there's one exception. The wp-content directory has a subdirectory called "uploads". It's the place where everything you upload using WP media upload feature gets put. The standard approach here is to leave it unblocked.

Here are the directives to get the above done:

Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/

Notice the small difference between the template at WP codex. They tell you to block “/wp-admin" (without the trailing "/" character). This can be problematic if you have your permalinks set to "/%postname%/" only. In this case every post with a slug beginning with "wp-admin-" won't get indexed.

I know that there's only a small group of bloggers that could have created such posts (the "blogging about WordPress" group), but as a WP developer you can't make any assumptions about what's going to happen on the blog you're working on after it takes off. That's why it's better to remember about the trailing "/" character here.

Things to block depending on your WP configuration

Every blog has a set of settings that are unique and need to be handled individually when creating the robots.txt file.

First thing is whether the blog uses categories or tags to structure the content … or both… or none.

In case you're using categories to structure your blog make sure that tag archives are blocked from search engines. To get it done first check what's the "tag base" for tag archives (Admin panel > Settings > Permalinks). If the field is blank then the base is "tag". Use this base and place it in a Disallow directive:

Disallow: /tag/

In case you're using tags to structure your blog make sure that category archives are blocked from search engines. Again, check the category base in the same place and then block it:

Disallow: /category/

In case you're using both categories and tags then don't do anything here.

In case you're using neither categories nor tags then block both of them by using their bases:

Disallow: /tag/
Disallow: /category/

Why should you bother? An honest question. The main reason here is the duplicate content issue. For example, if you're not using categories then your category archive looks exactly the same as your home page, i.e. there are two sites that are exactly the same but have different URLs:

yourdomain.com/
yourdomain.com/category/uncategorized

I'm sure I don't need to explain why that's bad. You have to make sure that such situation doesn't happen.

Next up is the authors' archive. If you're dealing with a single author blog then there's no point in keeping the authors' archive available to the search engines. It creates the same duplicate content issue as the tag-category thing. You can block author's archive by using:

Disallow: /author/

Files to block separately

WordPress uses a number of different files to display the content. Most of these don't need to be accessible via the search engines.

The list most often includes: PHP files, JS files, INC files, CSS files. You can block them by using:

Disallow: /index.php # separate directive for the main script file of WP
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

(The "$" character matches the end of an URL string.)

However, be careful with this. It's not advised to block any other files (images, text files, etc.). That's because even if such a file is not placed in the uploads directory you probably still want it to be recognized by the search engines.

Note. If you used the "Allow: /wp-content/uploads/" line earlier on, then all PHP, JS, INC, and CSS files that are inside the uploads directory would still be visible to the search engines � nature of the Allow directive.

Things not to block

The final choice is of course up to you, but I would not block any images from Google image search. It can be done by a separate record:

User-agent: Googlebot-Image
Disallow:
Allow: / # not a standard use of this directive but Google prefers it this way here

Another robot to handle individually would by the Google AdSense robot, of course, only when you are a part of their program. In this case you need to make sure that it can see all the pages that your users can see. The easies way of doing this is by using a very similar record:

User-agent: Mediapartners-Google
Disallow:
Allow: /

Of course, the issue doesn't end with just these two examples. There are probably many more of them because every blog is different. Feel free to comment and point out some additional areas of a WP blog that shouldn't be blocked.

How to handle duplicate content

No matter what you do your blog will always have some duplicate content. It's just how WP is constructed, you can't really prevent it. But you can still use robots.txt to prevent search engines from accessing it.

There's a number of duplicate content areas on every blog, for instance:

Search results

This is what a search result page URL usually looks like for a WP blog:

yourdomain.com/?s=phrase

(Sometimes there're also some additional parameters after the search phrase.)

This is both duplicate content and content generated automatically � something Google really doesn't like. That's why it's good to block this by using:

Disallow: /*?

Apart from blocking the search results this directive blocks access to all URLs that include a question mark, but this shouldn't cause any problems when it comes to WordPress.

Trackback URLs

Some blogs use trackback URLs that are essentially duplicate content of the original post. Here's an example of a normal post's URL and its trackback URL:

yourdomain.com/some-post/
yourdomain.com/some-post/trackback/

To prevent search engines from accessing such content you can use:

Disallow: /trackback/
Disallow: */trackback/

Now why the duplicate statements? The fact is that the implementation of the Robot Exclusion Standard can vary for different robots. By using these two lines you can be sure that it's understandable for all of them.

RSS feeds

RSS feeds are just another example of content that's purely duplicate. You can eliminate it from search eng

阅读更多内容

该邮件由 QQ邮件列表 推送。
如果您不想继续收到该邮件,可点此 退订

No comments:

Post a Comment