The first step to increasing your site’s visibility on the top search engines such as Google, Yahoo! and MSN is to help their respective robots crawl and index your site.

To avoid undesirable content in the search indexes, webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file. Conversely and importantly, webmasters can also notify the search engines about the existence and importance of pages with a sitemap.xml file. (Both files are placed in the root directory of the domain.)

Fortunately for the webmaster, the major search engines provide various tools to help manage both Sitemap and Robot files.

To gain an understanding of both ‘protocols’, I’ll discuss them briefly below.

Sitemaps (Inclusion Protocol)

The Sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs in the site. This allows search engines to crawl the site more intelligently. Sitemaps are a URL inclusion protocol and complement robots.txt, a URL exclusion protocol.

The webmaster can generate a Sitemap containing all accessible URLs on the site and submit it to search engines. Since Google, MSN, Yahoo!, and Ask use the same protocol now, having a Sitemap would let the biggest search engines have the updated pages information.

Sitemaps supplement and do not replace the existing crawl-based mechanisms that search engines already use to discover URLs. By submitting Sitemaps to a search engine, a webmaster is only helping that engine’s crawlers to do a better job of crawling their site(s). Using this protocol does not guarantee that web pages will be included in search indexes, nor does it influence the way that pages are ranked in search results.

The following is a cut-down version of the sitemap.xml for this website. WordPress, via a plugin, automatically updates this file each time a new post or page is written.

<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://www.simonwhatley.co.uk/</loc>
<lastmod>2008-10-08T14:50:16+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
http://www.simonwhatley.co.uk/big-city-little-people
</loc>
<lastmod>2008-10-08T14:50:16+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.1</priority>
</url>
</urlset>

More information about sitemaps can be found on the Sitemaps.org website.

Robots (Exclusion Protocol)

The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorise and archive web sites. The standard complements Sitemaps, a robot inclusion standard for websites.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorisation of the site as a whole.

The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.

For example, the following tells all crawlers not to enter four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Exclusion can also be achieved on a page-level basis using a Meta-tag. This is a tag that would be placed in the HTML head of of a web page. The robots attribute controls whether search engine spiders are allowed to index a page, or not, and whether they should follow links from a page, or not.

A common example could be as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-GB" xml:lang="en"> 
 <head profile="http://gmpg.org/xfn/11"> 
	<title>Simon Whatley</title>
	<meta http-equiv="robots" content="index,follow" />
</head>
<body>
</body>
</html>

A word of caution though, Meta tags are not the best option to prevent search engines from indexing content of your website.

More information about Robots.txt files can be found on the Robotstxt.org website.

Webmaster Tools

The top 3 search providers all have their own webmaster tools admin interface. The Google offering is the most advanced, but it’s good practice to use and submit information to all three.

Links to their services are provided below:

Ask doesn’t have an interface. However, you can still ping their Submission Service using the URL http://submissions.ask.com/ping?sitemap= in conjunction with your sitemap URL.

Further Information

The Web Standards Project (WaSP) is to expand its scope of collaboration with Adobe to advance web standards. Having successfully completed its initial goals for assisting Adobe’s Dreamweaver team in supporting Web standards, the Web Standards Project’s Dreamweaver Task Force will be renamed the Adobe Task Force to reflect its widened scope. The Adobe Task Force will collaborate with Adobe on all of the company’s products that output code or content to the Web, and will continue to advocate compliance with Web Standards and accessibility guidelines by those who use Adobe’s products to design and build Web sites and applications.

You can read the full press release on the Web Standards Project website.

Widening the collaboration between standards experts, who are also product experts, and Adobe is an exciting step forward in the maturation of the Web. This will hopefully lead to full standards support in not only Adobe-based products such as Dreamweaver and AIR, but leading browser and web editor suppliers such as Mozilla, Microsoft and Apple.

Companies need to make the most of Web 2.0, and web content management, collaboration and networking tools can help firms meet user demand for interactive websites. These tools aren’t simply restricted to the standard content management systems (CMS) used to publish text to a website, but tools that include file sharing, information sharing and instant messenging among others.

Effective web content management requires the capability for business leaders to take full control of the web as an interactive platform, rather than just treating it as another publishing medium. Keeping website visitors satisfied is a tough job. Currently, few corporate websites succeed with static, lifeless pages that lack interactivity. In contrast, pioneering websites, such as Amazon, Google and eBay set user’s expectations high with their compelling and dynamic content.

Because of these pioneering websites, the average visitor now expects targeted and personalised interactions with each and every company with which they come into contact on the web. In recent years the web content management franchise has expanded significantly beyond the 1990s paradigm of creation, management and publishing of content and other ‘resources’. As a result the tools are changing.

Ismael Chang Ghalimi has created an interesting list entitled Office 2.0 at IT|Redux. On this list, Ismael details a wide variety of web based business tools from bookmarking to business intelligence, calendars to contacts, databases to development tools, and beyond. What this list demonstrates is a shift towards new ways of data management, personalisation and targeting. New ways to interact with each and every interaction.

A recent survey from the Economist Intelligence Unit found that, despite early scepticism, “serious businesses” are starting to see that social networking technologies are not just for consumer sites such as YouTube and Facebook, but may also provide a major way for other brands to attract new customers and boost revenue.

 A compelling web experience is no longer based around simple web interactions, but around interactive tools.  The uptake of these tools, however, has been limited and we are only just seeing applications, such as wikis and blogs, join the corporate fold and become a generally accepted business tool.

Thousands of businesses worldwide face the challenge of establishing their web presence; a goal difficult to achieve without efficient web site development and testing tools. If someone where to ask you how good your website was, how would you answer; could you answer? There are so many factors to take into consideration, such as code validation, speed of download accessibility, usability etc, that there is no one correct answer and subsequently no one website that can provide you the definitive answer.

This article was inspired by a great blog post at Aviva Directory, entitled Grade Your Website: 31 Free Online Tests.

Below is a compendium of tools I use on a regular basis to test website I work on, based on Aviva Directory’s headings (incidentally they list the same tools I use regularly).:

Code Validation

The WDG HTML Validator is an excellent tool for identifying syntax errors on pages driven by markup languages. There is also an option to recursively check for errors on every page in the website directory, which is invaluable when checking large, dynamic websites.

The W3C Link Checker searches for and identifies broken links for a given URL. The tool specifically checks that all the links are de-referenceable, no links and anchors are defined twice and warns about invalid http and directory redirects.

Accessibility

Watchfire’s WebXACT is a must use tool for all serious designers and developers. The tool lets you test single pages and generates a very detailed report on the quality, accessibility and privacy of a website.

Speed

Web Page Analyzer from Website Optimization is an excellent tool that calculates page size, composition, and download time. The script calculates the size of individual elements and sums up each type of web page component (objects, CSS, images etc). Based on these page characteristics the script then offers advice on how to improve page load time. The script incorporates best practices web site optimisation techniques into its recommendations.

Browser Simulator

Browsershots is a tool, created by Johann C. Rocholl, which takes screenshots of your website in various browsers and platforms including Firefox and Internet Explorer on Windows, Firefox and Safari on Mac OS X and Iceweasal and Konqueror on Linux. When the user submits a URL it is added to a job queue. Unfortunately the queue requires you to wait up to three hours before retrieving your screenshots, but the results provide a clear indication of how the website will be received by different user setups.

Search Engine Optimisation (SEO)

SEO Workers SEO Analysis Tool is an extremely useful tool that analyses an assortment of page features including meta tags, keyword density and load time. A user simply submits a URL for testing and the report is returned.