Infogrid Pacific-The Science of Information


SMIL Audio Overlay Production

Preview. This article is not yet finalized. With IGP:Digital Publisher and the IGP:SMIL Toolkit it is possible to create interactive highlighting audio quickly and easily. Updated: 2013-08-21


This article is about creating SMIL files for e-books in general and ePub3 in particular. 

SMIL means Synchronized Multimedia Integration Language. It is an XML language that allows timing, layout, animations, transitions and other things to be scripted and executed by software application or reading device that understands SMIL XML.  

Unfortunately SMIL it is a bit old (SMIL 3.0 was released in 2008) and the Internet has moved on. Everything SMIL can do (and more) is now done more easily with Javascript, with ubiquitous browser support available. For example the AZARDI Interactive Engine supports all of the SMIL capabilities plus a lot more, and it is much easier to construct an event instruction list. 

There has not been a lot of mainstream support for SMIL production tools or SMIL presentation tools. Presentation engines are aged, limited, and have limited SMIL functionality. It many respects ePub3 support of SMIL is one of the more significant events in the life of the language. 

SMIL is very difficult to author and use which is why there is probably little or no current browser support. Fortunately the ePub audio overlay requirement is a small subset of what is a tough and punishing way to create interactive content.

We have started out with full support for SMIL audio overlay in production, and linear audio overlay support in all versions of AZARDI. The Apple iBooks minimal support in their ePub fixed layout format has also lifted awareness.


Different reading devices will use different methods to handle SMIL. Our approach is to bring the decade old XML kicking and screaming into 2012 with a heavy dose of Javascript.

If AZARDI recognizes a book section contains SMIL from the manifest properties, it loads the SMIL XML file and XSL transforms it to a Javacript array. This is very flexible and means we can easily make changes in the future for variations, customizations, and handling additional complexity if and when required. 

The Javascript array is then processed into the AZARDI Interactive Engine text-audio event/timeline simple text structure (which humans can edit). This is injected directly into the HTML page and hidden with CSS. The AZARDI interface SMIL audio icons are set to "live". The AZARDI Interactive Engine can immediately access and use the script: and the user can now interact with the audio/text content. 

Audio Text Syncronization Level

The primary purpose of SMIL audio overlays in ePub3 is so text can be highlighted in syncronization with audio at any required text granularity.  

The primary highlighting granularity options provide in the IGP:SMIL Toolkit are:

  • Word highlighting
  • Phrase highlighting
  • Sentence highlighting
  • Paragraph highlighting
  • Arbitrary or special purpose highlighting

Each option has applicability in certain scenarios, and of course there are times when you want a mix and match of all of them. That just takes more time to produce, and costs increase accordingly.  

Word highlighting

Sequential word highlighting in a reading context is possibly relevant for the young reader learner, or second language learner. It is useful for vocabulary building, but unless the reading is very slow, the interaction can be distracting.

Word highlighting can be used in dictionary and grammar building products as well. Remember there is no requirement for every word to be highlighted. 

0.000000 1.397163
1.397163 2.000738
2.000738 2.514894
2.514894 3.587916
3.587916 4.448568
4.448568 4.884483
4.884483 5.700427
5.700427 6.326356
6.326356 8.818896
8.818896 9.145832
9.145832 9.383350
9.383350 9.592924
9.592924 10.598882
10.598882 10.758158
10.758158 10.945378
10.945378 11.291875
11.291875 11.697052
11.697052 12.580000

This is word highlighting at a slow reading speed.

This is word highlighting at a fast reading speed.

Phrase highlighting

This means syncronizing at the minor terminator level. That means commas, dashes colons and semi-colons. If creates a relatively compelling engagement process with normal reading speeds.

0.000000 0.971074	
0.971074 1.738222	
1.738222 2.816114	
2.816114 3.612394	
3.612394 4.544625	
4.544625 6.059500	
6.059500 6.982020	
6.982020 9.050406	
9.050406 10.817760	
10.817760 13.090073	
13.090073 13.847510	
13.847510 17.081185	
17.081185 21.334488

Wow! Hey, do your like this, I mean thishang on a minutephrase based highlighting. I do because: it makes my eyes follow the text; it helps me concentrate; especially with boring content; and, it's like getting a lot of SMS's at the same time. It really fits my 2012 attention span deficit problem.

Here is a screen shot of the audacity editing window with an IGP:SMIL Toolkit generated label track. The numbers and text cues really help on large files.

The final output delivered a nicely paragraph aligned and processor ready label track for final SMIL file updating. The final generated and processed label track file looks something like this example below. This contains the highlight section ID, start-time, end-time and the cue sequence number and text for editing. Currently punctuation and special characters are stripped out. It has been used for Chinese, but only the sequence number survives the process.

#azs1 0.000000 0.971074 1. Wow
#azs2  0.971074 1.738222 2. Hey
#azs3 1.738222 2.816114 3. Do you lik
#azs4 2.816114 3.612394 4. I mean the
#azs5 3.612394 4.544625 5. Hang on 
#azs6 4.544625 6.059500 6. phrase bas
#azs7 6.059500 6.982020 7. I do becau
#azs8 6.982020 9.050406 8. it makes m
#azs9 9.050406 10.817760 9. it helps m 
#azs10 10.817760 13.090073 10. especially
#azs11 13.090073 13.847510 11. and 
#azs12 13.847510 17.081185 12. its like g
#azs13 17.081185 21.334488 3. It really 

Sentence highlighting

Sentence highlighting is a nice middle ground between phrases and paragraphs. Sentence lengths tend to be relatively uneven but generally do break long paragraphs into relevant parts.

The problem with sentence highlighting is relative paragraph length. Sentence length is highly variable. Even in a single paragraph. Charles Dickens was the master at creating the long, comma separated, noun, adjective, verb, adverb rich sentence, to build a strong mental picture; while keeping the story moving.  So? Bah! Humbug!

Paragraph highlighting

Paragraph lengths in all types of books is highly variable. However a paragraph theoretically does contain the expression of a self-contained theme or idea and has presentation styling rather than punctuation to give it isolation within text. For accessibility this is probably the preferrable option, and even for classroom learning it is a highly relevant approach.

Paragraph highlighting is probably the most useful granularity for accessibility bringing the flow of ideas and interactivity together.

Special purpose highlighting

Language education, learning or training content can be easily enriched with SMIL tagging. It can be used in structures such as vocabularies, glossary words, terms and much more.

There are also other less brutal means of achieving this such as direct Javascript driven interactions.

Click on the nouns in the list of words and listen.





Infogrid Pacific

The tools

IGP:Digital Publisher- primary production

The core content digital production is carried out in IGP:Digital Publisher. This enables simultaneous print, e-book and audio book production from a single master XHTML source. This has to be completed before any audio processing can start, primarily because we need IDs on the content.

IGP:FoundationXHTML has full paragraph IDs by default, so paragraph level audio sync needs no additional work. More granular highlighting options need more processing.

It also allows the timing information to be directly inserted into the XHTML to allow instant testing, evaluation and quality control.

Audacity-audio production and labels

Audacity is the well known, premier, open source audio editing application. It is relatively easy to learn to use for basic operations.

The main reason for using this application for SMIL production is to create an Audacity label.txt file. This lets you set and fine-tune the text highlight syncronization points to the millisecond. 
 You can also use Audacity for recording your audio, and if required mixing in a few effects. Or you can spend hours with Audacity and become a budding world class audio engineer. 

IGP:SMIL Toolkit

Creating those complex, annoying SMIL files is the big production issue and must be directly addresssed for cost effective AND high quality user experiences. In our system SMIL production is a two-pass process.

Step 1. Generate the Audacity Label track. Edit and fine tune it.

Step 2. Use the Audacity Label track to create the final SMIL file.  

Generating an Audacity Label track file

To alleviate some of the pain from inserting 1-10,000 label points in each Audacity label track for each chapter, we had to write a bit of software. The algorithm works like this.

  1. Set the granularity breaks you want in text. Paragraphs, defined major terminators, defined minor terminators, or words.
  2. Make sure the XHTML files have IDs, and the audio files have the section ID as the file name. Send them to the processor.
  3. The program parses the application and applies span statements with IDs to the SMIL ID granularity if less than paragraph granularity. The otherwise poor-cousin <span> element is suddenly the single most important element in all XML'dom.
  4. Generate a SMIL file, but most importantly an Audacity label track nicely numbered and named for each section.
  5. The label track is created with a two pass algorithm. First it counts all the characters in a file, and creates a count inventory of all the characters in each SMIL ID span tag.
  6. It does a simple character count per span section to timing chart and creates Audacity Label track pass one. This establishs the count of audio sections, and based on character count, a reasonable label offset positioning.
  7. But wait a minute! Audio talent is so damn expressive, and English is such a slurry, "phrasey" language with slippery phonemes that creates a number of positioning problems based on character count. The first character count pass doesn't take into account reading pace variation and artistic pauses, accelerations, sound-effects or general nonsense. The character count positions need to be modified by real human reading variations.
  8. Next the speech to text (flakey) algorithm is applied to find key expression points based on phoneme/text extracted positions. A reliability computation creates a series of key-frames. Where the reliability match is sufficient the character count is recomputed between key-frame expression points. It works most of the time. It fails magnificently from time to time.
  9. Pass all HTML, SMIL and Label tracks back for processing.  

Step 1. Generate and edit the label track

The HTML and audio files are sent to the processor. It returns the HTML files with granular tagging, IDs generated, and a matching audacity label tracks. 

The label file is now available for the human-touch fine-tune. It now only takes reasonable effort. The algorithm is moderately obvious, moderately clever, and really needs to grow and evolve. We are on the first step of a thousand mile journey here! Anyone got anything better! We would love to hear about it, see it, test it, etc. We will show you ours if you will show us yours!

Add the human touch

If you are going to do this SMIL production you have to learn to love audio editing. Audacity is brilliant, fantastic, smooth, easy, glorious (OK I have been reading too much Dickens).

Learn Audacity shortcuts, and learn what CTRL-1, CTRL-2 and CTRL-3 do. It's the best damn zooming key combination in the desktop UI business.

With a little practice a chapter can be QC'ed, while applying subtle fine-tuning, in near real time.

Step 2. The final SMIL file

The IGP:SMIL Toolkit requires only that you upload the section HTML file(s), and Audacity label.txt files you edited in Step 1, wait a few seconds and a perfect SMIL file is disgorged. The whole process is nearly fun!

Step 3. QC Inspection

For the nervous, there is a Step 3. available. It takes the Step 2. inputs and generates a playable version of the HTML files using the AZARDI Interactive Engine. Once generated they can be played instantly on the desktop.  

Just open the XHTML files in a browser of your choice, click through the text to hear the audio, or click play, lean back and just listen, watch or give it to another person to apply a final critical QC eye.

The ePub3 combination is greater than the parts

The generated SMIL files and the audio files are now available in the IGP:Digital Publisher Components directory. The work is done.

You can apply a little personalization on the CSS for your highlighting appearance preferences. You can give the SMIL highlighting effects of your choice book by book or even at fine text granularity.

Make sure you have included all required SMIL metadata.

Finally. From IGP:Formats on Demand, click the generate ePub3 button. Wait a few seconds and the complete ePub3 audio book package is delivered, validated and ready to start earning its living.


High volume, high quality production of fine-granularity ePubs is non-trivial work whether it is for accessibility, entertainment or learning. Going past a few dozen words or lines in a kiddies book is not just more of the same. Moving audio book production to main-stream and making it cost-effective, and as easy as it should be, takes significant tools and attention to detail.

EPub3 has the opportunity to bring audio books out of the Amazon/Audible lock-in, make them much more than a mp3 file, and bring the spoken and written word together in ways that have never been seen before.


Packaging SMIL in ePub3 at APEX@IGP Digital Formats


IGP:Digital Publisher

W3C SMIL 3 Recommendation http://www.w3.org/TR/2008/REC-SMIL3-20081201/

W3C Audio & Video activities information page. http://www.w3.org/AudioVideo/

IGP:Digital Publisher with IGP:SMIL Toolkit

Infogrid Pacific has the tools to allow the creation of the most sublime, sophisticated or complex digital content, and the delivery platforms to allow that content to be seen anywhere and everywhere under publisher control.

IGP:Digital Publisher. The world's most advanced, flexible and customizable multi-format digital content production environment. It addresses print, e-books, fixed layout, interactive learning content, web-sites, SCORM and much more.

comments powered by Disqus