An Introduction to the Semantic Web

The Semantic Web is a web of data. There is lots of data we all use every day, and most of it is not part of the web. I can see my bank statements on the web, and my photographs, and I can see my appointments in a calendar. But can I see my photos in a calendar to see what I was doing when I took them and on a map so I know where I took them? Can I see bank statement lines in a calendar? The answer, right now, is no.

But why not? Because we don’t have a web of data. Because data is controlled by applications, and each application keeps its data to itself; applications don’t like to share.

The original Web mainly concentrated on the interchange of documents, however, the Semantic Web is about two things: It is about common formats for integration and combination of data drawn from diverse sources. It is also about language for recording how the data relates to real world objects. That allows a person, or a machine, to start off in one database, and then move through an unending set of databases which are connected not by wires but by being about the same thing.

Tim Berners-Lee describes the Semantic Web vision as:

I have a dream for the Web [in which computers] become capable of analysing all the data on the Web, the content, links, and transactions between people and computers. A Semantic Web, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The intelligent agents people have touted for ages will finally materialise.

What are the ideas and technologies that facilitate this vision? Below I give an overview and links to a number of them:

Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods. More specifically, Wikipedia defines Linked Data as “a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.”

The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web.

The Resource Description Framework Schema (RDF-S) is a semantic extension of RDF that provides mechanisms for describing groups of related resources and the relationships between these resources.

The Resource Description Framework in Attributes (RDFa) allows authors to add meaning to web page elements. Using a few simple XHTML attributes, authors can mark up human-readable data with machine-readable indicators for browsers and other programs to interpret. A web page can include markup for items as simple as the title of an article, or as complex as a user’s complete social network.

The Friend of a Friend project is creating a Web of machine-readable pages describing people, the links between them and the things they create and do. FOAF is about your place in the Web, and the Web’s place in our world. FOAF is a simple technology that makes it easier to share and use information about people and their activities (eg. photos, calendars, weblogs), to transfer information between Web sites, and to automatically extend, merge and re-use it online.

The OWL Web Ontology Language is designed for use by applications that need to process the content of information instead of just presenting information to humans. OWL facilitates greater machine interpretability of Web content than that supported by XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics.

The Dublin Core set of metadata elements provides a small and fundamental group of text elements through which most resources can be described and catalogued. Using only 15 base text fields, a Dublin Core metadata record can describe physical resources such as books, digital materials such as video, sound, image, or text files, and composite media like web pages. Metadata records based on Dublin Core are intended to be used for cross-domain information resource description and have become standard in the fields of library science and computer science. Implementations of Dublin Core typically make use of XML and are Resource Description Framework (RDF) based.

A triplestore is a purpose-built database for the storage and retrieval of Resource Description Framework (RDF) metadata.

Much like a relational database, information is stored in a triplestore and retrieved via a query language called SPARQL. Unlike a relational database, a triplestore is optimised for the storage and retrieval of many short statements called triples, in the form of subject-predicate-object, like “Bob is 35” or “Bob knows Fred”.

SPARQL is an RDF query language, which can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be results sets or RDF graphs.

SKOS is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is built upon RDF and RDF-S, and its main objective is to enable easy publication of controlled structured vocabularies for the Semantic Web.

A PURL is a type of Uniform Resource Locator (URL) that does not directly describe the location of the resource to be retrieved but instead describes an intermediate, more persistent location which, when retrieved, results in redirection (e.g. via a 302 HTTP status code) to the current location of the final resource.

PURLs are an interim measure, while Uniform Resource Names (URNs) are being mainstreamed, to solve the problem of transitory URIs in location-based URI schemes like HTTP.

OpenCalais is a rapidly growing toolkit of capabilities that allow you to readily incorporate state-of-the-art semantic functionality within your blog, content management system, website or application.

The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit. Using Natural Language Processing (NLP), machine learning and other methods, Calais analyses your document and finds the entities within it. Calais goes beyond classic entity identification returning the facts and events hidden within your text as well.

If you have any more suggestions that should be included above, I’ll be happy to hear them.

Are you building something interesting?

Get in touch