The first step to increasing your site’s visibility on the top search engines such as Google, Yahoo! and MSN is to help their respective robots crawl and index your site.

To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file. Conversely and importantly, webmasters can also notify the search engines about the existence and importance of pages with a sitemap.xml file. (Both files are placed in the root directory of the domain.)

Fortunately for the webmaster, the major search engines provide various tools to help manage both Sitemap and Robot files.

To gain an understanding of both ‘protocols’, I’ll discuss them briefly below.

Sitemaps (Inclusion Protocol)

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

The webmaster can generate a Sitemap containing all accessible URLs on the site and submit it to search engines. Since Google, MSN, Yahoo!, and Ask use the same protocol now, having a Sitemap would let the biggest search engines have the updated pages information.

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. By submitting Sitemaps to a search engine, a webmaster is only helping that engine’s crawlers to do a better job of crawling their site(s). Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results.

The following is a cut-down version of the sitemap.xml for this website. WordPress, via a plugin, automatically updates this file each time a new post or page is written.

<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://www.simonwhatley.co.uk/</loc>
<lastmod>2008-10-08T14:50:16+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
http://www.simonwhatley.co.uk/big-city-little-people
</loc>
<lastmod>2008-10-08T14:50:16+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.1</priority>
</url>
</urlset>

More information about sitemaps can be found on the Sitemaps.org website.

Robots (Exclusion Protocol)

The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorise and archive web sites. The standard complements Sitemaps, a robot inclusion standard for websites.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorisation of the site as a whole.

The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.

For example, the following tells all crawlers not to enter four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Exclusion can also be achieved on a page-level basis using a Meta-tag. This is a tag that would be placed in the HTML head of of a web page. The robots attribute controls whether search engine spiders are allowed to index a page, or not, and whether they should follow links from a page, or not.

A common example could be as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-GB" xml:lang="en"> 
 <head profile="http://gmpg.org/xfn/11"> 
	<title>Simon Whatley</title>
	<meta http-equiv="robots" content="index,follow" />
</head>
<body>
</body>
</html>

A word of caution though, Meta tags are not the best option to prevent search engines from indexing content of your website.

More information about Robots.txt files can be found on the Robotstxt.org website.

Webmaster Tools

The top 3 search providers all have their own webmaster tools admin interface. The Google offering is the most advanced, but it’s good practice to use and submit information to all three.

Links to their services are provided below:

Ask doesn’t have an interface. However, you can still ping their Submission Service using the URL http://submissions.ask.com/ping?sitemap= in conjunction with your sitemap URL.

Further Information

With the advent Google Chrome there has been a lot of media coverage regarding the browser’s uptake and how it will compete with Internet Explorer, Firefox and Safari. This is where the User Agent becomes most valuable. It can be used in analytics software to determine the browser share and consequently aid the development of the website.

But what is a User Agent? A User Agent is the client application used with a particular network protocol; the phrase is most commonly used in reference to those which access the Web. Web user agents range from web browsers and e-mail clients to search engine crawlers (spiders), as well as mobile phones, screen readers and braille browsers used by people with disabilities. When Internet users visit a web site, a text string is generally sent to identify the user agent to the server. This forms part of the HTTP request, prefixed with user-agent: and typically includes information such as the application name, version, host operating system, and language. Bots, such as web crawlers, often also include a URL and/or e-mail address so that the webmaster can contact the operator of the bot.

By simply typing about:version into Chrome’s address bar you will be presented with the following information:

Google Chrome
0.2.149.29 (1798)
Official Build
Google Inc.
Copyright © 2006-2008 Google Inc. All Rights Reserved.
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13

As you can see Chrome’s version information provides limited detail about the browser. The last line is the important one. It is the HTTP User-Agent header:

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13.

If you know the RFC 2616 specification on the HyperText Transfer Protocol — which incidentally, I gladly don’t — you would know that the User Agent, or more formally, product token, should be short and to the point:

Product tokens SHOULD be short and to the point. They MUST NOT be used for advertising or other non-essential information. Although any token character MAY appear in a product-version, this token SHOULD only be used for a version identifier (i.e., successive versions of the same product SHOULD only differ in the product-version portion of the product value).

Clearly this isn’t the case! One of Google’s reason’s behind creating the Chrome browser was to start afresh. It would have therefore been truely amazing if they had made the string simply Chrome/0.2.149.27.

Unfortunately, browser sniffing makes an ever-growing UA string the path of least resistance for browser vendors.

So, what does Chrome’s User Agent string actually mean:

  • Mozilla/ - This means that browser has the kind of capabilities that Netscape 1.1 had compared to Mosaic and Lynx.
  • 5.0 - This means that the browser engine is from the post-Browser War Web Standards era as opposed to being from the Browser War era.
  • (Windows; - This means that general windowing system flavor the browser runs on is Windows (as opposed to, for example, Apple and X11).
  • U; - This means that the browser has at least the level of cryptographic capability / encryption strength that U.S. versions of browsers had in the late 1990s.
  • Windows NT 6.0; - This indicates the operating system the browser is running on. In this instance, the browser is running on Vista.
  • en-US) - This indicates the user interface language of the browser (U.S. English in this case). This may be used to choose between different content languages even though HTTP has a different header for that purpose.
  • AppleWebKit/ - This indicates that the engine of the browser is WebKit as opposed to being Gecko. Developers should not do user agent sniffing as a rule, but if they still do, this is what they should be sniffing.
  • 525.13 - This is the WebKit version from which Chrome branched its copy. Site admins could use this to detect old versions with known bugs.
  • (KHTML, like Gecko) - This introduces the substring Gecko into the UA string while pointing out to human readers that Webkit was forked from KHTML. Without this substring, Chrome might be put in the same category as IE and Netscape 4.
  • Chrome/ - This string identifies the browser as actually Google Chrome.
  • 0.2.149.27 - This is the Chrome version. This could be used to detect old versions with known bugs.
  • Safari/ - This means that the browser is like Safari as opposed to being like Firefox.
  • 525.13 - This just repeats the WebKit version in order to have some version but not the irrelevant Safari.app version.

On 1st 2nd September 2008 Google launched a new opensource browser project named Chrome.

UPDATE: You can download the beta project from the Google Chrome website.

Instead of me talking you through the project, Google and Scott McCloud have put together a cool little cartoon.

However, as a brief summary:

  • Google Chrome is Google’s open source browser project.
  • The browser will use the popular Webkit HTML rendering engine used in Safari and Adobe AIR.
  • The browser will include a brand new JavaScript Virtual Machine called V8.
  • The browser will include Gears to allow developers to enhance the user experience.
  • Google Chrome will use special tabs, like more traditional browsers, but set above the address and menu bar.
  • Each browser tab will run on its own process. If one tab fails for some reason, the whole browser will not need to be restarted, losing valuable work or tabs. This is similar to functionality found in Internet Explorer 8.
  • The browser has an address bar which includes a more intuitive auto-completion feature called ‘omnibox’. It is said to be less ‘irritating’ than current auto-complete/suggest functionality common to Firefox 3 or Google Suggest.
  • As a default homepage Chrome presents you with a kind of speed dial feature, similar to the one found in Opera.
  • Chrome has a privacy mode, which allows you to create an incognito window and nothing that occurs in that window is ever logged on your computer. Again, this is similar to functionality found in Internet Explorer 8.
  • Web apps can be launched in their own browser window without address bar and toolbar, much like Mozilla’s Prism project.
  • To fight malware and phishing attempts, Chrome constantly downloads lists of harmful sites.

(Click on the images to see a larger view)

Google Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser ProjectGoogle Chrome - Opensource Browser Project


You can find out more information from Google’s blog post on the subject.