You are currently browsing the tag archive for the ‘pdf’ tag.
Tag Archive
Library of Congress site on the Sustainability of Digital Formats
May 15, 2008 in Projects | Tags: audio, compression, file formats, metadata, pdf | Leave a comment
This site contains some detailed technical analyses of file formats suitable for long term preservation, as viewed through the lense of the LOC’s collecting policy. The site was compiled by Caroline R. Arms and Carl Fleischhauer, and is clearly the result of a great deal of thought and research. Site accessed 15 May 2008.
Overview
The purpose of the LOC’s Digital Formats website is to support the long term preservation of digital objects by (a) identifying formats promising for sustainability, (b) identifying other formats which are not promising and which therefore need alternative strategies for content preservation, and thereby (c) recommending which formats to use when building up a collection. The site concentrates on technical aspects of file formats. The site is concerned with the formats associated with media-independent (intangible) digital content, in other words content that is managed as files and which is independent of a particular physical medium; this rules out formats associated with media-dependent (tangible) digital content, such as DVDs, audio CDs, videotape formats.
“The Semantic Web”
December 21, 2007 in Books, Reviews | Tags: metadata, pdf, XML | Leave a comment
A Guide to the Future of XML, Web Services, and Knowledge Management; by Michael C Daconta, Leo J Obrst and Kevin T Smith, 2003. Available from Amazon.
Very propagandistic for the semantic web. Ah well. Have mainly noted the digi pres aspects, ignoring stuff like thesauri and ontologies.
Software companies like to keep data internal to their applications for competitive reasons. Binary formats lock you into applications for the life of the data. But the balance of power is shifting from apps towards data, driven by the interoperability demands of the Internet.
XML: the current primary driver behind XML’s rise and rise is data exchange between and within organisations. The reasons for XML’s success include:
- it creates application-independent documents and data – it is plain text in human readable form
- it supplies a standard syntax for metadata
- it supplies a standard structure for documents and data
- it is already well accepted and proven
- computers are now powerful enough to cope with verbose XML statements
- XML can be easily searched, unlike binary files.
But XML isn’t enough on its own. XML only enables interoperability. People still need to know and understand the element names (which is why semantic web heads spend so long on ontologies). People have to put in extra effort. We as people need to be able to tag our information with machine-understandable markup, to know what information is authentic, and to correlate it with information we already have. Machines can only work with well-defined problems on well-defined data.
How XML works: XML is not a language, but a set of syntax rules for creating semantically rich markup languages in other domains.
- Data: the raw, context-specific values.
- Metadata: the semantics (meaning or purpose) of those values. XML provides a simple way to encode metadata.
- Document: a combination of content and presentation. XML enables metadata markup on both, thereby bridging the gap. The data/presentation split is crucial: the book calls this the “Model-View-Controller paradigm” (MVC) but perhaps this is just “separation of concerns.”
- Element: a container comprising start tag, content, and end tag. The content can include sub-elements.
- Attribute: a name plus a value, eg. src=image.gif. An element can have more than one attribute.
- Tagged content divides the document into semantic parts.
- Well-formed: the XML document obeys all the rules. It is mandatory.
- Valid: the XML document references and satisfies a schema. It is optional, but really all XML documents should be checked for validity before transfer elsewhere. Validation allows true interoperability, and allows XML documents to be broken up into XML fragments for independent use.
- Schema: a separate document which defines the legal elements, attributes and structure of an XML instance document. As such it acts as (a) a template to generate instances, and (b) a validator to ensure document accuracy. DTD (part of the XML 1.0 recommendation) itself has non-XML syntax, lacks data types (eg string, boolean, date etc), and lacks support for namespaces. XML Schema fixes all these. Every instance document has to declare which DTD or schema it adheres to, which is done by a special attribute in the root element, “xsi: …” Schemas are tricky things to write, but by agreeing on a standard schema organisations can produce documents which can be validated, transmitted and parsed by any application regardless of hardware or operating system. [AA later note: METS is an XML Schema.]
- Namespace: a globally unique name for our markup language’s elements and attributes.
- Stylesheet: allows us to specify how an XML document can be presented in different media formats. A stylesheet engine takes an XML document, loads it into its Document Object Model (DOM) alongside the stylesheet, and spits out the resulting document. XSL provides a mechanism for transforming XML documents into other XML documents (XSLT) and a vocabulary for formatting objects (XSLFO). A stylesheet can even transform XML into executable code. There’s a top diagram on p. 125.
PDF: this book says Adobe is (2003) restructuring PDF around RDF, so that the information within a PDF file can be understood even if the software doesn’t know what a PDF is or how to display it.
RDF: resource description framework. An XML-based language to describe resources, usually a file on the web. RDF creates metadata about the document as a standalone entity. It is especially good for ‘opaque’ resources like images or audio. Dublin Core was originally an RDF application.
