Optimise Your URLs for Web Crawlers and Indexing

Many questions about website architecture, crawling and indexing, and even ranking issues can be boiled down to one central issue: How easy is it for search engines to crawl your site?

The Internet is not simply a big place it is a huge place; new content is being created all the time. Google, Yahoo and Microsoft each have a finite number of resources, so when faced with the nearly-infinite quantity of content that’s available online, their various crawlers are only able to find and crawl a percentage of that content. Then, of all the content they’ve crawled, they’re only able to index a portion. Of course with the cheapness of storage, the search engines are able to index more and more content each day, but not at the pace the Web is growing.

URLs are like the bridges between your website and a search engine’s crawler: crawlers need to be able to find and cross those bridges (i.e., find and crawl your URLs) in order to get to your site’s content. If your URLs are complicated or redundant, crawlers are going to spend time tracing and retracing their steps; if your URLs are organised and lead directly to distinct content, crawlers can spend their time accessing your content rather than crawling through empty pages, or crawling the same content over and over via different URLs.

So, what can you do as a website developer or owner to reduce that labyrinth of URLs and helping crawlers find more of your content faster? Below are a few ideas:

  • Remove unnecessary query string details from the URL.
    Parameters in the URL that don’t change the content of the page–like session IDs or list sort orders–can be removed from the URL and put into a cookie. By putting this information in a cookie and 301 redirecting to a clean URL, you retain the information and reduce the number of URLs pointing to that same content.
  • Stop infinite pagination in, for example, lists and calendars.
    If you have a calendar with infinite past and future dates or a list with infinite pagination you have what is described as an infinite crawl space, which is a huge burden on crawlers. To resolve the calendar issue, you can add no-follow attributes to links to dynamically created future calendar pages. When creating pagination links, disable previous and next links when the first and last pages are reached and redirect users to an appropriate page if the query string in the URL is hacked (this may be a page not found static page).
  • Utilise the robots.txt file to prevent actions the web crawlers can’t or shouldn’t perform.
    Using a robots.txt file, you can disallow crawling of login pages, contact forms, shopping carts, and other pages whose sole functionality is something that a crawler can’t and shouldn’t perform. This lets crawlers spend more of their time crawling content that they can actually do something with.
  • Prevent duplicate content.
    An ideal scenario for crawlers is a one-to-one link between content an a URL. Each URL leads to a unique bit of content and each piece of content can be accessed by a unique URL. The closer your site can get to this scenario, the more streamlined your site will be for crawling and indexing. If your CMS makes this difficult to achieve, you can use the canonical tag to indicate a preferred URL for duplicate content.

More information on this topic can be found on the Google Webmaster Central Blog.

View Comments

  1. Regarding ‘Stop infinite pagination in, for example, lists and calendars.’ – This is assuming the robots respect the rel=”nofollow” links – there’s a lot of bad robots on the web right now, some that don’t respect the robots.txt either.

  2. Simon says:

    @todd you do have a good point, albeit to a certain degree we should help the “good” crawlers out there and ignore the “bad”.

    Having said that, it is probably better preventing lists and calendars scrolling infinitely by removing links and preventing URL hacking, since the “bad” crawlers could put unneccessary load on your servers.

blog comments powered by Disqus