semanticweb.jpgA Guide to the Future of XML, Web Services, and Knowledge Management; by Michael C Daconta, Leo J Obrst and Kevin T Smith, 2003. Available from Amazon.

Very propagandistic for the semantic web. Ah well. Have mainly noted the digi pres aspects, ignoring stuff like thesauri and ontologies.

Software companies like to keep data internal to their applications for competitive reasons. Binary formats lock you into applications for the life of the data. But the balance of power is shifting from apps towards data, driven by the interoperability demands of the Internet.

XML: the current primary driver behind XML’s rise and rise is data exchange between and within organisations. The reasons for XML’s success include:

  • it creates application-independent documents and data – it is plain text in human readable form
  • it supplies a standard syntax for metadata
  • it supplies a standard structure for documents and data
  • it is already well accepted and proven
  • computers are now powerful enough to cope with verbose XML statements
  • XML can be easily searched, unlike binary files.

But XML isn’t enough on its own. XML only enables interoperability. People still need to know and understand the element names (which is why semantic web heads spend so long on ontologies). People have to put in extra effort. We as people need to be able to tag our information with machine-understandable markup, to know what information is authentic, and to correlate it with information we already have. Machines can only work with well-defined problems on well-defined data.

How XML works: XML is not a language, but a set of syntax rules for creating semantically rich markup languages in other domains.

  • Data: the raw, context-specific values.
  • Metadata: the semantics (meaning or purpose) of those values. XML provides a simple way to encode metadata.
  • Document: a combination of content and presentation. XML enables metadata markup on both, thereby bridging the gap. The data/presentation split is crucial: the book calls this the “Model-View-Controller paradigm” (MVC) but perhaps this is just “separation of concerns.”
  • Element: a container comprising start tag, content, and end tag. The content can include sub-elements.
  • Attribute: a name plus a value, eg. src=image.gif. An element can have more than one attribute.
  • Tagged content divides the document into semantic parts.
  • Well-formed: the XML document obeys all the rules. It is mandatory.
  • Valid: the XML document references and satisfies a schema. It is optional, but really all XML documents should be checked for validity before transfer elsewhere. Validation allows true interoperability, and allows XML documents to be broken up into XML fragments for independent use.
  • Schema: a separate document which defines the legal elements, attributes and structure of an XML instance document. As such it acts as (a) a template to generate instances, and (b) a validator to ensure document accuracy. DTD (part of the XML 1.0 recommendation) itself has non-XML syntax, lacks data types (eg string, boolean, date etc), and lacks support for namespaces. XML Schema fixes all these. Every instance document has to declare which DTD or schema it adheres to, which is done by a special attribute in the root element, “xsi: …” Schemas are tricky things to write, but by agreeing on a standard schema organisations can produce documents which can be validated, transmitted and parsed by any application regardless of hardware or operating system. [AA later note: METS is an XML Schema.]
  • Namespace: a globally unique name for our markup language’s elements and attributes.
  • Stylesheet: allows us to specify how an XML document can be presented in different media formats. A stylesheet engine takes an XML document, loads it into its Document Object Model (DOM) alongside the stylesheet, and spits out the resulting document. XSL provides a mechanism for transforming XML documents into other XML documents (XSLT) and a vocabulary for formatting objects (XSLFO). A stylesheet can even transform XML into executable code. There’s a top diagram on p. 125.

PDF: this book says Adobe is (2003) restructuring PDF around RDF, so that the information within a PDF file can be understood even if the software doesn’t know what a PDF is or how to display it.

RDF: resource description framework. An XML-based language to describe resources, usually a file on the web. RDF creates metadata about the document as a standalone entity. It is especially good for ‘opaque’ resources like images or audio. Dublin Core was originally an RDF application.