Thursday 16 February 2012

Basic SEO Troubleshooting With XML Sitemaps

Despite the fact that sitemaps are simply lists of canonical URLs submitted to search engines, it's amazing how rare it is to come across a perfect one. Issues often arise when large site owners use sitemap auto generation tools that aren't configured properly. These sites typically come with challenges for search engine crawlers like pagination and URLs generated by faceted navigation.

Spiders decide what pages to crawl based on URLs placed in a queue from previous crawls and that list is augmented with URLs from XML sitemaps. Therefore, sitemaps can be a key factor in ensuring search crawlers access and assess the content most eligible to be seen in search engine results.

The following is a quick overview of search engine sitemap guidelines and limitations followed by a technique to help identify crawling and indexation issues using mutiple sitemaps in Google Webmaster Tools.

Bing & Google Guideline & Limitation Overview

The sitemap protocol has been a standard adopted by search engines in 2006. Since then Bing and Google have developed useful Webmaster Tool dashboards to help site owners identify and fix errors.

Out of the two search engines, Bing particularly has a low threshold, or at least they outwardly state they begin devaluing sitemaps if 1 percent of the URLs result in an error (return anything but a status code 200).

Google provides clear guidelines, limitations, and a more robust error reporting system when using their webmaster dashboard. In addition to submitting quality sitemaps, ensure that files stay within the following hard limits applicable to Google.
  • Limit sitemaps to 50,000 URLs
  • File size should be under 50MB
  • 500 sitemaps per account
Both search engines support sitemap index files. Rather than submitting multiple sitemap files individually, the sitemap index file makes it easier to submit several sitemap files of any type all at once.

Basic Sitemap Optimization

Basic sitemap optimization should include checking for pages that are:
  • Duplicated (multiple URLs in different sitemaps are OK)
  • Returning status code errors - 3XX, 4XX, and 5XX
And any pages that specify:
  • Meta rel canonicals that are not self-referential
  • Noindex meta robots tags
There are tools to quickly parse URLs contained within XML files and find this information like the Screaming Frog SEO crawler.

Using Google Webmaster Tools

Once comprehensive and quality XML sitemaps have been submitted to Google and Bing, breaking up sitemaps into categories can provide further insight into crawling and indexation issues.
sew-google-webmaster-tool-multiple-sitemaps
A great place to start is by breaking up sitemaps by page type. Sitemaps can be diced up in any way that makes sense to provide feedback, the main goal being to expose any areas of a site with a low indexation rate.

Once an area has been identified, finding the source of the issue can begin. Using Fetch as Googlebot to identify uncrawlable content and links is often very helpful...Read More