Infogrid Pacific-The Science of Information

Part 2

G: Content Genre Structures

The sections in this part detail special tagging for complex content, or content structures that can have significant value added through detailed semantic tagging when required or appropriate.

Genre tagged content is often the target for additional processing, content extraction or special presentation. It needs to be mobile within a general document, or presentable by itself.

There is a mistaken belief that the best retrodigitization tagging is where everything is tagged in fanatical detail. This is not correct unless budgets are very large, quality control methods are very rigorous, and possibly the exercise is being pursued for academic goals. The Infogrid Pacific technology framework and IGP:FoundationXHTML make it easy to carry out incremental tagging and add value as required by business demands. This helps keep costs and revenues matched.

For example, detail tagging the preliminaries of a book, the title and copyright pages for content extraction is probably a technical exercise that is far better handled by creating an excellent separate metadata structure. Title and copyright page content can then be tagged as required for presentation, linking, and if necessary content substitution and removal as is generally required for eBook and digital editions.

Excellent retrodigitization tags content which needs to be deliver business strategies. If the task is extraction of dictionary information into a database, then very detailed tagging of dictionary items may be required.

It may not be the best approach to the problem. Content extraction from continuous flowing text is not always sensibly the best approach. Keying directly into database fields may be the better strategy.

Detailed tagging of content on the basis that one day we may want to, for example, extract the individual speeches of Othello, should only be done if the output requirement has been determined in advance. Will it every be used?

The art of retrodigitization tagging design is to minimize the amount of labour a human has to do, to minimize the number of quality defect opportunities. The art of front list tagging design is to ensure that the tags that are applied make the XML typesetters' task easier, not more complex and that e-book formats drop out easily. In both cases the value of the content is defined by what is doable in a business context.

Having made these polemic statements, there are still customers and situations where detailed tagging is required and useful, or at least being paid for one way or another.

A characteristic of complex content is extensive horizontal layout which is essential to preserve either/or the authors intent, or the interpretation of the content. FX does a very good job identifying content layout for lineated content. Where the internally defined strategies are insufficient, then tabular layout is used.

Another characteristic of complex content is extensive block, paragraph and inline styles. Recipes are a notable example.

 © 2005-2012 Infogrid Pacific. All rights reserved.

comments powered by Disqus