APEX@IGP

Infogrid Pacific-The Science of Information

5

The Open Packaging Format file

EPUB3 Packaging-2

ePub3 Packaging 2 discusses and illustrates the OPF or Open Packaging Format in detail with discussion about the big three structures: metadata, the manifest and the spine and the essential Unique Identifier Updated: 2012-07-28

The heart of the ePub format is the Open Package Format (OPF) file. It contains the ePub declarations, metadata, the manifest and the spine. It is what makes a collection of files an ePub.

Here it is in outline with minimum clutter. If you are not familiar with XML this may look a little confusing, but it is very straight-forward.

Wrapped around everything is the root element <package> ... </package>. It starts on the second line and is the very last line.

Inside that the three main blocks of data <metadata>...</metadata>, <manifest>...</manifest> and <spine>...</spine> can be seen.

This example shows the three mandatory metadata items. Title, identifier and language.

<?xml version="1.0" encoding="UTF-8"?>
<package  version="3.0" xml:lang="en" 
    unique-identifier="pub-id">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:title id="title">Title</dc:title>
    <dc:identifier id="pub-id">1234567890123</dc:identifier>
    <dc:language>en-UK</dc:language>
    <meta property="dcterms:modified">2011-09-01T00:00:00</meta>
</metadata>
<manifest>
   <item id="toc" properties="toc" href="toc.xhtml" media-type="text/xhtml+xml" />
   <item id="cover-image" properties="cover-image" href="cover.jpg" media-type="images/jpeg" >
   <item id=" " properties=" " href=" " media-type=" " />
</manifest>
<spine>
    <itemref idref="toc" linear="yes|no" page presentation="left|right" />
</spine>
</package>

Unique Identifier

The unique identifier is mandatory. In an attempt to get this under more control, the unique ID is referred to the dc:identifier (or scheme of your choice). This contains whatever identifier you are using for the ePub, generally an ISBN-13 for book publishers.

If a book does not have an identifier available in the production system metadata when an ePub generation request is made, our packager puts in a default "NOID".

We have been creating ePubs in volume since the ePub specification was released in September 2007. In those learning days we generated UIDs by referencing the DateTime stamp and the unique-identifier to maintain file control based on dates. The ePub3 dcterms:modified property uses a similar generation method and replaces our generated unique-identifier.

In ePub3 the unique-identifier references the ID reference in the dc:identifier and with the modified property give the same set of values. This is a small, but nice change.

Metadata

We process the minimum possible metadata into an ePub3 at present. The specification mandatory items are title, identifier and language. When ePubs are distributed they usually go with a large amount of ONIX or spreadsheet metadata. There seems to be little point in putting a lot of metadata into the file itself at present.

The metadata that should go into the ePub is the information that will make it easier to locate the book in a Reader. Our recommendation is: 

  • Title
  • Identifier
  • Language
  • Author
  • Publisher
  • Description
  • Subject (In text - no BISAC or BIC. That should be in external metadata)
  • Date Published

There may however be library and archive situations where other metadata vocabularies and more extensive terms could be valuable. But that is tomorrow and can be addressed if and when the demand arises.

The ePub3 specification has gone to a lot of trouble to make metadata packaging as complex as possible to address all possible user metadata systems. There were obviously forces at work. Bless them.

We package the full DC-15 metadata fields if they are available from a publisher, but they seldom are. It seems very few publishers have strong metadata strategies that assist the description and classification of a document for an end-user.

As reader implementations and the use of ePub3 evolves we will see what metadata is of value in the ePub3 internal context. Our general view is that the more metadata that is encapsulated in a digital format, the harder it is to maintain over time with production volume and velocity. Dynamic, born-digital and updated product are all going to introduce their own issues here.

In our packaging, mandatory descriptive metadata uses the Dublin Core scheme and is: 

  • dc:title
  • dc:identifier
  • dc:language
  • dc:creator
  • dc:date
  • dc:modified (using meta property)

Last Modified Timestamp

This is another slightly quirky property defined as mandatory in ePub3. Or more correctly it is required property that has been quirkily defined!

It is a production generation TimeDate stamp. There are a number of non-specific examples in the specification as to when the date should be changed.

The rules would appear to be a little bit optimistic in complexity. Our system generates a new date-time stamp every time an ePub is generated. That is within the rules as the date time stamp should be stable as soon as it goes into distribution. If for any reason we are requested to make changes to an ePub that has already gone into retail distribution, the new ePub will always have a new date-time stamp generated irrespective of the complexity of the changes. I think that covers this requirement.

Additional Metadata Properties

The specification has a number of other "forward-looking" metadata properties.

We are not supporting packaging of any of the specified extended metadata vocabularies at present. AZARDI is also not supporting them at read time.

These may be added to the packaging options with metadata crosswalks if required in the future. We can easily add standard vocabularies such as ONIX, but it costs a lot more in implementation, and assembly complexity. These will be evaluated for packaging strategies as, if, and when they become relevant.

It's one thing to type up an example of variable metadata strategies. It is a different thing to implement them in a production facility producing dozens of ePub books (and other formats) for dozens of publishers every day.

Manifest

The manifest is a list of all files in the package. It must be complete and all files must be referenced.

We have significantly updated our manifest presentation rules to make it easier for a production engineer to interprete and check the content if there is a requirement to un-zip the package (and there sometimes is).

The manifest is  very important for a reading device and the new properties attributes are very valuable. The only vague properties attribute is remote resources which is particularly frustrating to decode from the specification.

AZARDI specifically doesn't support it and we have to wait and see if any other reader does, and whether it is important to publishers to be able to deliver online resources to ePub Readers. Magazine publishers may have such a requirement.

AZARDI has quite elaborate on-load methods to make sure all pages that can be displayed will display. If it finds missing files it will issue an appropriate warning but continue to try and display the content. It does this primarily by doing a file inventory check with the manifest as a package is loaded.

The Manifest is so important it is dealt with further in its own article.

Spine

The spine creates the reading order of a document. It must contain at least one item. Our new "art-IDs" means the spine is simple and easy to read.

The big deal with the spine for most publishers is the sequence, what gets put in and left out, and what is linear yes and no in that reading order. It is important for an assembly system to be able to address reordering of the spine and implement the various options thought up by marketing and digital content experts justifying their existence.

<spine page-progression-direction="ltr">
    <itemref idref="landmarks" linear="yes"/>
    <itemref idref="toc" linear="yes"/>
    <itemref idref="s001" linear="yes"/>
    <itemref idref="s002" linear="yes"/>
    <itemref idref="s003" linear="yes"/>
    <itemref idref="s004" linear="yes"/>
    <itemref idref="s005" linear="yes"/>
    <itemref idref="s006" linear="yes"/>
    <itemref idref="s007" linear="yes"/>
    <itemref idref="s008" linear="yes"/>
    <itemref idref="s009" linear="yes"/>
    <itemref idref="s010" linear="yes"/>
    <itemref idref="s011" linear="yes"/>
    <itemref idref="s012" linear="yes"/>
    <itemref idref="s013" linear="yes"/>
    <itemref idref="s014" linear="yes"/>
</spine>

There are many arguments and issues that come up here and will be discussed at length. EPub 3 has plugged some of the ePub2 gaps such as the cover not being explicitly handled. 

So should the cover be in the spine? The cover image is now explicitly available to the Reader. There doesn't even have to be a cover XHTML page. We say let the reader handle cover presentation and don't clutter the spine.

Should the TOC be in the spine? With ePub 2, publishers of more complex books liked to put a replica of the Book TOC into the spine order and have it linked from Chapter titles. This was to compensate for the weakness of TOC strategies in every reader on the market.

Having to jump-link to a distant TOC page and then page-link back to a selected section is nothing short of painful. This was such a big anti-user issue we designed the unique side-float TOC into AZARDI. It's available all the time and allows easy traversal of even the largest books.


AZARDI

You can see ePub3 files in operation in AZARDI.

The AZARDI desktop reader has very high conformance support for ePub2 and ePub3.

AZARDI supports MathML, SVG, external references and javascript.

It supports non-standard audio and video - WebM and OGG. This is because it is based on Firefox and open-source standards.

GET AZARDI HERE

  To comment on this topic please use this blog link.

comments powered by Disqus