You are currently browsing the tag archive for the 'metadata' tag.
Tag Archive
PREMIS Data Dictionary 2.0
June 13, 2008 in Uncategorized | Tags: metadata | 1 comment
The PREMIS (PREservation Metadata: Implementation Strategies) Editorial Committee
has published version 2.0 of the PREMIS Data Dictionary for preservation metadata. PREMIS is an international initiative which has created a common metadata standard for long-term digital preservation with a broad area of application. The PREMIS Data Dictionary for Preservation Metadata Version 2.0 is available for free download here.
The first Data Dictionary, containing helpful recommendations and hints, was published in May 2005, and comprises specific elements considered necessary for preserving digital content. The dictionary is independent of any specific architecture, and it omits any detailed metadata about areas already covered in existing metadata standards (sauch as description, rights and agents). The PREMIS Data Dictionary defines “preservation metadata” as the information a repository uses to support the digital preservation process.
Library of Congress site on the Sustainability of Digital Formats
May 15, 2008 in Projects | Tags: audio, compression, file formats, metadata, pdf | Leave a comment
This site contains some detailed technical analyses of file formats suitable for long term preservation, as viewed through the lense of the LOC’s collecting policy. The site was compiled by Caroline R. Arms and Carl Fleischhauer, and is clearly the result of a great deal of thought and research. Site accessed 15 May 2008.
Overview
The purpose of the LOC’s Digital Formats website is to support the long term preservation of digital objects by (a) identifying formats promising for sustainability, (b) identifying other formats which are not promising and which therefore need alternative strategies for content preservation, and thereby (c) recommending which formats to use when building up a collection. The site concentrates on technical aspects of file formats. The site is concerned with the formats associated with media-independent (intangible) digital content, in other words content that is managed as files and which is independent of a particular physical medium; this rules out formats associated with media-dependent (tangible) digital content, such as DVDs, audio CDs, videotape formats.
Notes on METS
February 27, 2008 in standards | Tags: metadata, XML | Leave a comment
METS (the Metadata Encoding and Transmission Standard) is an XML Schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. The METS format was designed to allow the sharing of information management tools and services, and to facilitate the interoperable exchange of digital materials among institutions. The Schema was first brought out in 2001. Read the rest of this entry »
e-GMS: the UK eGovernment Metadata Standard
February 11, 2008 in standards | Tags: metadata, XML | Leave a comment
The e-GMS, part of the e-GIF, is an application profile of the Dublin Core Metadata Element Set. It replaced the older e-GMF. Version 2 came out in December 2003, and is available here.
The standard consists of mandatory, recommended and optional metadata elements.
Mandatory and Recommended Metadata Elements
Creator
An entity responsible for making the content of the resource. Not to be confused with Publisher: the Creator is responsible for the intellectual or creative content of the resource, whereas the publisher simply makes the resource available. Include the full hierarchy, e.g. department, division, section, team. Give full contact details if possible, especially when they are not to be given elsewhere, i.e. where the creator is different from the publisher/distributor. Read the rest of this entry »
Borghoff et al: content migration
January 14, 2008 in Books | Tags: metadata, migration | Leave a comment
Notes from Chapter 3 of Borghoff et al.
Migration never ends. This means that organisations have to establish a proper strategy, and try to routinize it as much as possible. It helps too to migrate documents into a standard format, or as few standard formats as possible. There are fewer standard formats than non-standard formats, so the difficulty of the task is reduced. But information losses are often unavoidable. You could migrate within a product family (eg from Word 6.0 to Word 2003, say) but that depends on SW developers being committed to supporting legacy versions, which isn’t mandatory [AA: see my notes on backwards compatibility, for example.]
Advantages of migration are:
-
end users can view all documents on current rendition systems; they do not have to learn ancient apps and interfaces
-
the data in migrated documents can be copied into and manipulated by the end user’s own SW
-
less staff training needed
-
we’ll never ‘forget’ about the context or meaning of some documents, as the migration process forces us to consider the documents every few years
-
presentation of document data may even improve.
Disadvantages:
-
difficult to automate, especially for compound documents
-
errors accumulate
-
potential loss of authenticity, and ‘look and feel’
-
expensive, as all documents in a given class have to be migrated at each step.
Best to continue to store the original bitsream alongside the migrated version. (AA: this is also what other people have suggested.) Anyone who’s really desperate for authenticity can then pay for the costs to have the original app emulated. The MD needs to contain detailed migration history data.
Borghoff et al: Long-Term Preservation of Digital Documents: Chapter 1
January 13, 2008 in Books | Tags: authenticity, costs, DIT, emulation, metadata, migration | Leave a comment
Notes from Borghoff et al. General intro to the field. Defines long term as “over 50 years.” In principle, bitstreams can be kept indefinitely. But the bitstream is inaccessible without an adequate rendition system, and this is what causes the problems. Often only the creating app can render the stream, but the app is itself dependent on a specific hardware/OS combination. A rendition system is a complex organism comprising the hardware layer, the OS and driver programs, and the presentation layer. Changes in one component prompts changes in the others.
Character sets like ASCII seem to be pretty long-lived.
Advantages of migration include:
- ICT people know it and use it themselves, so there’s lots of experience
- migrated docments are available on the current system, by definition
- it satisfies the contemporary user’s needs and expectations.
Disadvantages of migration:
- it often results in minor adjustments to the document, reducing authenticity
- it cannot usually be automated, so it needs checking. [AA: is this true? Xena software?]
Advantages of emulation:
- high authenticity
- should result in relatively small costs per document
Disadvantages of emulation:
- you have to have the HW spec, this is crucial. And the complete SW bitstream. And all the manuals, too. [AA: hope they are clearly written!]
- emulators are complicated things to write
- there may be SW licensing and copyright costs.
Other bits in this chapter which interested me:
- Changing documents to a restricted set of standard formats is a possibility, but you will lose some information, just as you would when converting from one WP format to another. On the other hand, the costs of digital preservation are proportionate to the number of formats involved, so the fewer the formats the better.
- Because metadata is useful for document retrieval, it should ideally be stored separately from the document itself. The MD should be stored in a simple character set like ASCII. [AA: or even on paper?] The very minimum MD set should be (a) author, title, date etc, (b) subject area and keywords for retrieval, (c) location of the document, (d) encoding and data format, (e) migration history and description of the original environment, (f) legal and access issues.
Spreadsheet authenticity
January 10, 2008 in Alans thoughts | Tags: file formats, metadata, migration | 2 comments
Preserving a spreadsheet on its own isn’t enough. As with word processed documents, you need to keep metadata about the spreadsheet, capturing information about:
· the department or organisation which created it
· the date it was created
· the business function which the spreadsheet was supposed to carry out
· other spreadsheets or files which the current spreadsheet linked to
· the file format it is in, and the application used to create it
· the preservation log details, such as when it was accepted into the archive, and by whom.
AHDS think it is mandatory to keep information about the purpose and content of the spreadsheet as a whole, each worksheet, each column (plus data type), each row, and ‘coding scheme’ (?what that?).
To be honest it depends on the sort of spreadsheet you want to preserve. About 90% of all Excel spreadsheets are just used by people as a kind of list manager. So if your spreadsheet is a list of all your favourite Bob Dylan songs, then your spreadsheet simply isn’t going to include any functions or charts or macros, and preservation becomes a bit simpler. But if your spreadsheet contains VAT calculation data for your business for the past ten years then you need to adopt a more rigorous approach.
Tips
· Never set a password to open the spreadsheet – the password will be lost.
· Use headers and footers to record metadata too.
· Use standard fonts, not strange ones.
· Be consistent over time and date formats.
· State the currency in the cell name (it may get lost or replaced in migration).
· Do not convey information though formatting, sych as cell or font colour, or font style.
· DPT recommend setting up Excel templates for spreadsheets-identified-for-preservation, to include such things as meaningful human-readable names for rows and columns, fields for consistent metadata entry etc.
· Once you’ve migrated, you need to check it – count the number of rows and columns, check the dates are ok, check that variables look like what they are supposed to and so on.
Questions – does XML and tab-delimited keep the Comments?
JSA paper re Paradigm (Personal Archives Accesible in DIGital Media)
December 29, 2007 in Articles: JSA, Projects | Tags: metadata, oais | Comments closed
Susan Thomas and Janette Martin, ‘Using the Papers of Contemporary British Politicians as a Testbed for the Preservation of Digital Personal Archives,’ in JSA Vol 27 No 1, 2006
A Bodleian Library-John Rylands UL project, using staff at the Oxford Digital Library (ODL). Project began in Jan 2005 and was due to finish Feb 2007. Article mentions various iniatives and organisations which probably need researching at some point. “Digital preservation is far too complex, and urgent, an issue for any one organisation to tackle alone; therefore, co-operation and standardisation have become watchwords for a digital preservation community.”
Talks about OAIS. OAIS deliberately exchewed IT or archives jargon, to force both groups to speak the same language, although the language is therefore effectively new to both. The model is designed to be as context-specific as possible. OAIS also acts as a framework for developers of digital repository software: Paradigm itself uses Fedora. Makes the point that the metadata used for an IP varies according to its place in the model. At SIP stage, the metadata is likely to have been supplied by the producer, and will probably be unstructured and non-comprehensive. In the AIP the metadata (called the Preservation Description Information or PDI) is full and structured. At DIP stage the metadata will depend on the designated community, but is likely to be descriptive rather than technical.
Anyway the OAIS concept of ‘designated community’ means that you tend to see solutions for specific contexts, giving rise to digital repository types, eg systems to preserve e-journals in libraries (LOCKSS), web-archiving repositories, electronic thesis repositories etc – again they give some examples of real-world projects here. Sourceforge gets mentioned as an open source software repository (Marathon! Hooray!).
Paradigm is interesting as it deals with the personal records of politicians, and so no standards can be imposed on the creation, management, or disposition of records or submission metadata.
more to do
Review: “A Framework of Guidance for Building Good Digital Collections”
December 22, 2007 in Articles: JSA, Reviews | Tags: metadata | Comments closed
By the Institute for Museum and Library Services (IMLS), 2001. Reviewed by Elizabeth Yakel in JSA vol 23 no 2, 2002
IMLS is a federal US agency created in 1996 to foster leadership and innovation in US libraries and museums, and in practice funds preservation and access projects. The framework is indeed aimed at those seeking funding. IMLS defines as indicators of ‘goodness’ interoperability, reusability, persistence, verification, and documentation. The principles apply at three levels:
- collection level – need good selection policies
- object level – a good object can be authenticated
- metadata level – “good metadata should be appropriate to the materials of the collection, users of the collection, and intended, current and likely use of the collection.”
Yakel includes in her review some endorsements of the framework by the Digital Library Federation (DLF), a US private sector body. Also mentions a quote that digital preservation is “an oxymoron”, a quote from Mark Ackerman and Roy Fielding in 1995.
“The Semantic Web”
December 21, 2007 in Books, Reviews | Tags: metadata, pdf, XML | Leave a comment
A Guide to the Future of XML, Web Services, and Knowledge Management; by Michael C Daconta, Leo J Obrst and Kevin T Smith, 2003. Available from Amazon.
Very propagandistic for the semantic web. Ah well. Have mainly noted the digi pres aspects, ignoring stuff like thesauri and ontologies.
Software companies like to keep data internal to their applications for competitive reasons. Binary formats lock you into applications for the life of the data. But the balance of power is shifting from apps towards data, driven by the interoperability demands of the Internet.
XML: the current primary driver behind XML’s rise and rise is data exchange between and within organisations. The reasons for XML’s success include:
- it creates application-independent documents and data – it is plain text in human readable form
- it supplies a standard syntax for metadata
- it supplies a standard structure for documents and data
- it is already well accepted and proven
- computers are now powerful enough to cope with verbose XML statements
- XML can be easily searched, unlike binary files.
But XML isn’t enough on its own. XML only enables interoperability. People still need to know and understand the element names (which is why semantic web heads spend so long on ontologies). People have to put in extra effort. We as people need to be able to tag our information with machine-understandable markup, to know what information is authentic, and to correlate it with information we already have. Machines can only work with well-defined problems on well-defined data.
How XML works: XML is not a language, but a set of syntax rules for creating semantically rich markup languages in other domains.
- Data: the raw, context-specific values.
- Metadata: the semantics (meaning or purpose) of those values. XML provides a simple way to encode metadata.
- Document: a combination of content and presentation. XML enables metadata markup on both, thereby bridging the gap. The data/presentation split is crucial: the book calls this the “Model-View-Controller paradigm” (MVC) but perhaps this is just “separation of concerns.”
- Element: a container comprising start tag, content, and end tag. The content can include sub-elements.
- Attribute: a name plus a value, eg. src=image.gif. An element can have more than one attribute.
- Tagged content divides the document into semantic parts.
- Well-formed: the XML document obeys all the rules. It is mandatory.
- Valid: the XML document references and satisfies a schema. It is optional, but really all XML documents should be checked for validity before transfer elsewhere. Validation allows true interoperability, and allows XML documents to be broken up into XML fragments for independent use.
- Schema: a separate document which defines the legal elements, attributes and structure of an XML instance document. As such it acts as (a) a template to generate instances, and (b) a validator to ensure document accuracy. DTD (part of the XML 1.0 recommendation) itself has non-XML syntax, lacks data types (eg string, boolean, date etc), and lacks support for namespaces. XML Schema fixes all these. Every instance document has to declare which DTD or schema it adheres to, which is done by a special attribute in the root element, “xsi: …” Schemas are tricky things to write, but by agreeing on a standard schema organisations can produce documents which can be validated, transmitted and parsed by any application regardless of hardware or operating system. [AA later note: METS is an XML Schema.]
- Namespace: a globally unique name for our markup language’s elements and attributes.
- Stylesheet: allows us to specify how an XML document can be presented in different media formats. A stylesheet engine takes an XML document, loads it into its Document Object Model (DOM) alongside the stylesheet, and spits out the resulting document. XSL provides a mechanism for transforming XML documents into other XML documents (XSLT) and a vocabulary for formatting objects (XSLFO). A stylesheet can even transform XML into executable code. There’s a top diagram on p. 125.
PDF: this book says Adobe is (2003) restructuring PDF around RDF, so that the information within a PDF file can be understood even if the software doesn’t know what a PDF is or how to display it.
RDF: resource description framework. An XML-based language to describe resources, usually a file on the web. RDF creates metadata about the document as a standalone entity. It is especially good for ‘opaque’ resources like images or audio. Dublin Core was originally an RDF application.
