The first step to increasing your site’s visibility on the top search engines such as Google, Yahoo! and MSN is to help their respective robots crawl and index your site.

To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file. Conversely and importantly, webmasters can also notify the search engines about the existence and importance of pages with a sitemap.xml file. (Both files are placed in the root directory of the domain.)

Fortunately for the webmaster, the major search engines provide various tools to help manage both Sitemap and Robot files.

To gain an understanding of both ‘protocols’, I’ll discuss them briefly below.

Sitemaps (Inclusion Protocol)

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

The webmaster can generate a Sitemap containing all accessible URLs on the site and submit it to search engines. Since Google, MSN, Yahoo!, and Ask use the same protocol now, having a Sitemap would let the biggest search engines have the updated pages information.

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. By submitting Sitemaps to a search engine, a webmaster is only helping that engine’s crawlers to do a better job of crawling their site(s). Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results.

The following is a cut-down version of the sitemap.xml for this website. WordPress, via a plugin, automatically updates this file each time a new post or page is written.

<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://www.simonwhatley.co.uk/</loc>
<lastmod>2008-10-08T14:50:16+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
http://www.simonwhatley.co.uk/big-city-little-people
</loc>
<lastmod>2008-10-08T14:50:16+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.1</priority>
</url>
</urlset>

More information about sitemaps can be found on the Sitemaps.org website.

Robots (Exclusion Protocol)

The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorise and archive web sites. The standard complements Sitemaps, a robot inclusion standard for websites.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorisation of the site as a whole.

The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.

For example, the following tells all crawlers not to enter four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Exclusion can also be achieved on a page-level basis using a Meta-tag. This is a tag that would be placed in the HTML head of of a web page. The robots attribute controls whether search engine spiders are allowed to index a page, or not, and whether they should follow links from a page, or not.

A common example could be as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-GB" xml:lang="en"> 
 <head profile="http://gmpg.org/xfn/11"> 
	<title>Simon Whatley</title>
	<meta http-equiv="robots" content="index,follow" />
</head>
<body>
</body>
</html>

A word of caution though, Meta tags are not the best option to prevent search engines from indexing content of your website.

More information about Robots.txt files can be found on the Robotstxt.org website.

Webmaster Tools

The top 3 search providers all have their own webmaster tools admin interface. The Google offering is the most advanced, but it’s good practice to use and submit information to all three.

Links to their services are provided below:

Ask doesn’t have an interface. However, you can still ping their Submission Service using the URL http://submissions.ask.com/ping?sitemap= in conjunction with your sitemap URL.

Further Information

by Dan Wellman

Synopsis

Learning the Yahoo! User Interface Library book coverThe Yahoo! User Interface (YUI) Library is a set of utilities and controls, written in JavaScript, for building richly interactive web applications using techniques such as DOM scripting, DHTML, and AJAX. The YUI Library also includes several core CSS resources. All components in the YUI Library have been released as open source under a BSD License and are free for all uses.

This book covers all released components whether utility, control, core file, or CSS tool. Methods of the YAHOO Global Object are used and discussed throughout the book. The basics of each control are presented, along with a detailed example showing its use to create complex, fully featured, cross-browser, Web 2.0 user interfaces.

Besides giving you a deep understand of the YUI library, this book aims to expand your knowledge of object-oriented JavaScript programming, as well as strengthen your understanding of the DOM and CSS.

The core aim is to teach you how to create a number of powerful JavaScript controls that can be used straight away in your own applications.

Download

Download the latest YUI version, including full API documentation and more than 250 functional examples from Sourceforge.

The library’s developers blog frequently at the YUI Blog and the YUI Library community exchanges ideas at YDN-JavaScript on Yahoo! Groups.

Book Review

The Yahoo! User Interface Library sits comfortably amongst its peers, which, along with many others, include Prototype, jQuery and Mootools. Arguably it can be said that the YUI library is the king among the JavaScript and CSS-libraries. With a vast number of well documented examples and near 100% compatibility amongst modern browsers, it would be difficult to find a comparable library.

It is one thing to be a well documented library, but it is another to know how to use the libraries to construct a user interface. This is the niche Dan Wellman fills with his book. Although not necessarily for the beginner, since you need a knowledge of CSS, JavaScript and a little AJAX, Wellman does a good job of explaining the concepts, especially AJAX, from scratch.

Wellman provides an A-to-Z of the library and assumes, rightly, that the reader has little or no knowledge of the library. To that effect, he does a long introduction of the YUI, following an overall review of its components, listing them in the first chapter. He then picks up a selection of some of the most established utilities, for example navigation, animation and AJAX utilities and in the following chapters he covers one or two examples for each of them.

Importantly, the book teaches the reader how to not only use the DOM manipulation and event handling aspects of the library, but also the CSS tools of the library.

Wellman does a good job of introducing the technical aspects at the beginning of each chapter, but not dwelling too long before moving on to real usage and methods.

What I would have liked to have seen is more interaction between different components written about in the book. Clearly building a fully-featured application that incorporates most or all of the key components would be unweildy, but individual and isolated examples doesn’t equate real-world scenarios either. For example, it is quite conceivable that autocomplete and drag-and-drop components would be utilised on the same page; it would have been good if Wellman had explained the pains or pitfalls that may be encountered with such combinations. The negativity aside, the examples are of a good quality.

The book does contain a number of errors, but since this is the first edition you can probably forgive the editors from missing them.

A major gripe I have with this book, indeed all technical books is the lack of colour throughout. It is far easier to read and understand the example code when code colouring is employed, allowing for easier understanding of the key elements in the code. Surely modern publishing techniques can mitigate against the extra cost of colour. Indeed, I would pay more for a well-written coloured technical book.

A great summary chapter on graceful degredation versus progressive enhancement would also have been welcomed, since many developers may not consider the usability and accessibility issues of using JavaScript.

This book is certainly a good read for anyone who has basic knowledge of JavaScript, HTML and CSS and who wants to learn how to apply the YUI library in their projects, making them more interactive for the user.