Optimise Your URLs for Web Crawlers and Indexing

by Simon. Average Reading Time: almost 3 minutes.

Many questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue: How easy is it for search engines to crawl your site?

The Internet is not simply a big place it is a huge place; new content is being created all the time. Google, Yahoo and Microsoft each have a finite number of resources, so when faced with the nearly-infinite quantity of content that’s available online, their various crawlers are only able to find and crawl a percentage of that content. Then, of all the content they’ve crawled, they’re only able to index a portion. Of course with the cheapness of storage, the search engines are able to index more and more content each day, but not at the pace the Web is growing.

URLs are like the bridges between your website and a search engine’s crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site’s content. If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organised and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.

So, what can you do as a website developer or owner to reduce that labyrinth of URLs and helping crawlers find more of your content faster? Below are a few ideas:

  • Remove unnecessary query string details from the URL.
    Parameters in the URL that don’t change the content of the page–like session IDs or list sort orders–can be removed from the URL and put into a cookie. By putting this information in a cookie and 301 redirecting to a clean URL, you retain the information and reduce the number of URLs pointing to that same content.
  • Stop infinite pagination in, for example, lists and calendars.
    If you have a calendar with infinite past and future dates or a list with infinite pagination you have what is described as an infinite crawl space, which is a huge burden on crawlers. To resolve the calendar issue, you can add no-follow attributes to links to dynamically created future calendar pages. When creating pagination links, disable previous and next links when the first and last pages are reached and redirect users to an appropriate page if the query string in the URL is hacked (this may be a page not found static page).
  • Utilise the robots.txt file to prevent actions the web crawlers can’t or shouldn’t perform.
    Using a robots.txt file, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can’t and shouldn’t perform. This lets crawlers spend more of their time crawling content that they can actually do something with.
  • Prevent duplicate content.
    An ideal scenario for crawlers is a one-to-one link between content an a URL. Each URL leads to a unique bit of content and each piece of content can be accessed by a unique URL. The closer your site can get to this scenario, the more streamlined your site will be for crawling and indexing. If your CMS makes this difficult to achieve, you can use the canonical tag to indicate a preferred URL for duplicate content.

More information on this topic can be found on the Google Webmaster Central Blog.

This article has been tagged

, , , , , , , , , , , , , , , ,

Other articles I recommend

Google, Yahoo and Microsoft Webmaster Tools

The first step to increasing your site’s visibility on the top search engines such as Google, Yahoo! and MSN is to help their respective robots crawl and index your site. To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file. Conversely and importantly, webmasters can also notify the search engines about the existence and importance of pages with a sitemap.xml file

Canonical URLs – What Are They All About?

Carpe diem on any duplicate content worries: Google, Yahoo and Microsoft now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that’s accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.

Tools to Evaluate the Speed of Your Website

Google’s addition of a page speed signal to its search rankings algorithm officially links performance with search engine marketing. The loading speed of a web page affects user psychology in a number of ways, and now it can effect its rankings as well.

  • Regarding ‘Stop infinite pagination in, for example, lists and calendars.’ – This is assuming the robots respect the rel=”nofollow” links – there’s a lot of bad robots on the web right now, some that don’t respect the robots.txt either.

  • @todd you do have a good point, albeit to a certain degree we should help the “good” crawlers out there and ignore the “bad”.

    Having said that, it is probably better preventing lists and calendars scrolling infinitely by removing links and preventing URL hacking, since the “bad” crawlers could put unneccessary load on your servers.