Infogrid Pacific-The Science of Information


ePub 3 SMIL Packaging

ePub 3 Packaging-7 It's easy-ish! Updated: 2012-07-28


Like all the ePub3 packaging article this is about how we have approached SMIL packaging for ePub3, and the decisions we had to make.

To package SMIL files into an ePub correctly SMIL files and their audio files must of course exist. The audio is easy. Talent performs into a microphone and the file is commited to digital eternity.

Synchronizing this accurately with text for interaction or highlighting is not as easy. Especially with "fine granularity".

Reading devices

At the time of writing the only available SMIL reading devices are AZARDI, and iPad for kid books with fixed layout books.

We have to address ePub3 delivery in the AZARDI Cloud Reader, to the AZARDI Desktop Reader, as well as in any other standard ePub3 reader that supports SMIL if and when they emerge.

The AZARDI Cloud Reader in particular has to get valuable content securely to every user in every browser, everywhere.

Therefore we provide MP3 and OGG or WEBM audio in the package by default. It increases the size of the package, but makes sure that everyone everywhere, in every reader environment can listen to the audio.

At the time of writing iPad has simple SMIL capabilities available in fixed layout ePubs. We have made a few poetry books targeted at that reading device, but the interaction is rather mediocre. All AZARDI avatars from release 11 now supports SMIL.

Preparing the package

IGP:Digital Publisher has a special components directory that allows virtually any sort of file to be packaged into an ePub, on the basis that it is referred to from either an HTML, CSS or Javascript file. So after the attention focused task of creating millisecond synchronized SMIL files, they are loaded into the IGP:Digital Publisher Project Components Directory.

There was considerable debate on the files names we should use for the SMIL and audio files. The final approach we choose was XHTML production ID correlation.

This meant that SMIL file generation had to follow HTML production, but that seemed to make more sense than a separate audio timing generation activity that is then post synced to the HTML in a "horrorville" process.

It also meant we could have multiple editors working on the audio for a single book at the same time to shorten total production time.

So the production steps are: 

  1. Correctly tag and QC all your XHTML files that are going into the audio book.
  2. Generate nice and brief IDs at the paragraph level to keep file sizes under control.
  3. Submit the candidate XHTML files to the SMIL Toolkit selecting the grammatical highlighting resolution option required to generate a near-sync ID.txt Audacity label track and candidate ID.SMIL file.
  4. Fine tune the audio/text sync points in Audacity using the generated Audacity label track file.
  5. Submit the finished label track to the SMIL Toolkit to adjust the the timing on the *.SMIL XML files.
  6. Upload ID.SMIL files and ID.mp3/ogg audio files to the IGP:Digital Publisher Components directory
  7. Click the ePub3 button to process the package using IGP:Formats on Demand to ePub 3.

 It's pretty linear, but there is a lot of audio listening going on in there! Creating granular audio/text synchronization for a full novel or language training program is not the same as a 20 words per page kids reader. If you are using phrase-based syncronization think 3,000 events per chapter.

The Packaging

So IGP:Digital Publisher has a set of inputs that consists of: 

  1. Beautifully tagged and marked up XHTML with section IDs.
  2. SMIL files with IDs that match the sections.
  3. MP3 and OGG audio files with file names that match the SMIL files.
  4. All the other CSS, and potentially rich media and interactive goodness of an ePub3 file.

Step 1. Including all files

The most interesting packaging challenge in our approach is that during one of the early package optimization steps, the packaging process activity reads the XHTML, CSS referenced from the XHTML and compiles a list of files for the package to make sure only files that are required are included in the ePub3 manifest.

We had to extend our evaluation with the ad-hoc SMIL rule.

  1. If *.SMIL files exist,
  2. Check they match the source XHTML IDs.
  3. Read the SMIL files to get the audio file references
  4. Put the *.SMIL files and audio files that pass the evaluation into the final package file list.
  5. If the audio files referenced from the *.SMIL files have format duplicates, register those as well.

Step 2. Sorting out the media metadata

As a part of the packaging, the system needs to be able to produce media timing metadata and various other metadata. If the packager knows what files are included this is relatively trivial to extact and present in the required metadata patterns. Of course the system must have access to the audio talent metadata to include this as a reference.

<?xml version="1.0" encoding="UTF-8"?>
<package  version="3.0" 
    xml:lang="en" unique-identifier="pub-id">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
  <dc:title id="title">A Christmas Carol Audio Book</dc:title>
  <dc:creator id="creator">Charles Dickens</dc:creator>
  <dc:identifier id="pub-id">a-christmas-carol-audio-book</dc:identifier>
  <meta property="dcterms:modified">2012-03-08T05:57:24Z</meta>
  <dc:publisher>Infogrid Pacific</dc:publisher>
  <meta property="media:duration" refines="#btitle">0:0:03</meta>
  <meta property="media:duration" refines="#intro">0:3:25</meta>
  <meta property="media:duration" refines="#preface">0:1:52</meta>
  <meta property="media:duration" refines="#author">0:0:28</meta>
  <meta property="media:duration" refines="#ch01">0:42:07</meta>
  <meta property="media:duration" refines="#ch02">0:38:38</meta>
  <meta property="media:duration" refines="#ch03">0:49:53</meta>
  <meta property="media:duration" refines="#ch04">0:32:45</meta>
  <meta property="media:duration" refines="#ch05">0:14:21</meta>
  <meta property="media:duration">3:3:32</meta>
  <meta property="media:narrator">LibriVox Team</meta>
  <meta property="media:active-class">correct</meta>

The last nice little step is identifying the media:active-class. This is the class that the engine injects into an XHTML fragment when it is highlighted. We are currently using "correct" as it inherits from the AZARDI Interactive Engine instructional modules. We will probably make that a little more audio-listening topical in a future update.

Step 3. Create the manifest

We had to make packaging changes for the manifest creation as well. This is where ePub3 manifests start to look very different from ePub2.

Fortunately this is a relatively mechanical process, except that we include MP3 and OGG audio in our ePub3 package, because the content must be available Online for all browsers as well as dedicated reading devices. We used OGG as the first audio in the fallback chain with MP3 at the bottom.

Here is the manifest for A Christmas Carol. The interesting and new (for us) points are:

  1. The inclusion of the media-overlay reference in the primary XHTML items,
  2. The inclusion of the SMIL files themselves
  3. The use of the fallback property from the OGG audio file to the ePub3 core media type MP3 audio file.
  <item id="toc" properties="nav" href="TOC.xhtml" media-type="application/xhtml+xml"/>
  <item id="cover" href="cover.xhtml" media-type="application/xhtml+xml"/>
  <item id="cover-image" properties="cover-image" href="a-christmas-carol.jpg" media-type="image/jpeg"/>
  <item id="s001" href="s001-BookTitlePage-01.xhtml" media-type="application/xhtml+xml" media-overlay="btitle"/>
  <item id="s002" href="s002-Copyright-01.xhtml" media-type="application/xhtml+xml"/>
  <item id="s003" href="s003-Introduction-01.xhtml" media-type="application/xhtml+xml" media-overlay="intro"/>
  <item id="s004" href="s004-Preface-01.xhtml" media-type="application/xhtml+xml" media-overlay="preface"/>
  <item id="s005" href="s005-AboutTheAuthor-01.xhtml" media-type="application/xhtml+xml" media-overlay="author"/>
 <item id="s006" href="s006-Chapter-001.xhtml" media-type="application/xhtml+xml" media-overlay="ch01"/>
  <item id="s007" href="s007-Chapter-002.xhtml" media-type="application/xhtml+xml" media-overlay="ch02"/>
  <item id="s008" href="s008-Chapter-003.xhtml" media-type="application/xhtml+xml" media-overlay="ch03"/>
  <item id="s009" href="s009-Chapter-004.xhtml" media-type="application/xhtml+xml" media-overlay="ch04"/>
  <item id="s010" href="s010-Chapter-005.xhtml" media-type="application/xhtml+xml" media-overlay="ch05"/>
  <item id="btitle" href="s001-BookTitlePage-01.smil" media-type="application/smil+xml"/>
  <item id="intro" href="s003-Introduction-01.smil" media-type="application/smil+xml"/>
  <item id="preface" href="s004-Preface-01.smil" media-type="application/smil+xml"/>
  <item id="author" href="s005-AboutTheAuthor-01.smil" media-type="application/smil+xml"/>
  <item id="ch01" href="s006-Chapter-001.smil" media-type="application/smil+xml"/>
  <item id="ch02" href="s007-Chapter-002.smil" media-type="application/smil+xml"/>
  <item id="ch03" href="s008-Chapter-003.smil" media-type="application/smil+xml"/>
  <item id="ch04" href="s009-Chapter-004.smil" media-type="application/smil+xml"/>
  <item id="ch05" href="s010-Chapter-005.smil" media-type="application/smil+xml"/>
  <item id="audio01" href="audio/s001-BookTitlePage-01.ogg" fallback="audio02" media-type="audio/ogg"/>
  <item id="audio02" href="audio/s001-BookTitlePage-01.mp3" media-type="audio/mpeg"/>
  <item id="audio03" href="audio/s003-Introduction-01.ogg" fallback="audio04" media-type="audio/ogg"/>
  <item id="audio04" href="audio/s003-Introduction-01.mp3" media-type="audio/mpeg"/>
  <item id="audio05" href="audio/s004-Preface-01.ogg" fallback="audio06" media-type="audio/ogg"/>
  <item id="audio06" href="audio/s004-Preface-01.mp3" media-type="audio/mpeg"/>
  <item id="audio07" href="audio/s005-AboutTheAuthor-01.ogg" fallback="audio08" media-type="audio/ogg"/>
  <item id="audio08" href="audio/s005-AboutTheAuthor-01.mp3" media-type="audio/mpeg"/>
  <item id="audio09" href="audio/s006-Chapter-001.ogg" fallback="audio10" media-type="audio/ogg"/>
  <item id="audio10" href="audio/s006-Chapter-001.mp3" media-type="audio/mpeg"/>
  <item id="audio11" href="audio/s007-Chapter-002.ogg" fallback="audio12" media-type="audio/ogg"/>
  <item id="audio12" href="audio/s007-Chapter-002.mp3" media-type="audio/mpeg"/>
  <item id="audio13" href="audio/s008-Chapter-003.ogg" fallback="audio14" media-type="audio/ogg"/>
  <item id="audio14" href="audio/s008-Chapter-003.mp3" media-type="audio/mpeg"/>
  <item id="audio15" href="audio/s009-Chapter-004.ogg" fallback="audio16" media-type="audio/ogg"/>
  <item id="audio16" href="audio/s009-Chapter-004.mp3" media-type="audio/mpeg"/>
  <item id="audio17" href="audio/s010-Chapter-005.ogg" fallback="audio18" media-type="audio/ogg"/>
  <item id="audio18" href="audio/s010-Chapter-005.mp3" media-type="audio/mpeg"/>
  <item id="css-001" href="css/a-christmas-carol.css" media-type="text/css"/>

We let the code flow outside the box. Why not? There is no paper or device screen limitations here.

AZARDI Desktop is based on Mozilla and only understands OGG audio. But the same problem occurs when the audio is required in an all-browser online context. To make the selection and playing of the audio native to a reading application in HTML5, a reading device needs to do something like convert the manifest item into an HTML 5 audio element with fallbacks. : 

    <src="audio/8-ChristmasCarolStave5.mp3" type="audio/mpeg" />
    <src="audio/8-ChristmasCarolStave5.ogg" type="audio/ogg" />|
    Your reader does not support audio

In AZARDI the SMIL file is XSL processed into a JSON array, converted into an ID centric text list and then played arbitrarily by the engine.

The MP3 audio file name is retrieved from the SMIL file and any fallbacks from the manifest. These are then converted to the HTML5 audio structure and made available to the playlist.

In AZARDI the SMIL file is just a definition container that is not actually used in the live playback arena.

Step 4. Wrap the package together

The final step is standard and easy. The ePub3 package is assembled, validated and ready for work. In our packaging all audio files are placed in an audio directory which is a direct child of the main OPS directory that contains the OPF and XHTML files. 


We have a SMIL production environment that works for linear audio books of any size. We still have to address the additional complexities of noteboxes, tables, sidebars and similar author/editor/genre complexities. However that hasn't come up as a requirement yet, but we are in a good position (we hope) when they do. 

The system generates ePub3 SMIL audio overlays for audio books of any size. The production system allows the SMIL synchronizing resolution to be at the word, phrase, sentence or paragraph level. The packaging system makes sure everything is ePub3 compliant. The AZARDI reading systems make sure everything works and can be delivered. The SMIL loop is closed. It isn't theory or a kiddies hand-editing playground any more.

This production process + production tool creation + ePub3 reading device development shows that it can all be made to work relatively easily and inexpensively.

Text to speech is coming along fine. It is even fantastic. But for truly expressive and personality based reading, SMIL will deliver a unique experience for a few more years.

ePub3 and SMIL media overlays address accessibility for certain, but also make rich new entertainment, education, learning and training content, with powerful audio features both possible, and available everywhere.

comments powered by Disqus