APEX@IGP-FX

Infogrid Pacific-The Science of Information

11

Document Structure

Document Type

In FX the document structure is defined by the assembly of a set of document sections in a coherent manner to make a work that has value.

There is nothing in IGP:FoundationXHTML that describes the document genre or type. This information is carried in metadata only. For example there is no XML element or selector "Book". This is deliberate to address content for all types of documents with emphasis the fact that the content must be instantly reusable, remixable and extensible at the section level in many contexts. 

FoundationXHTML View of a document

 At its most basic, text document content is:

  • A sequence of character strings (words or other data forms) forming...
  • A stack of paragraphs  (or objects) forming...
  • Sections of some sort forming...
  • A document.

This is the document primary text flow or the galley. This may or may not reflow, be paginated or have interactivity applied depending on the output presentation context.

The flow is interrupted at certain points by document section structures which have defined stop and start points. Eg. Parts, Chapters, Topics, Sections, Web pages, fixed-layout pages, etc.

The flow is enhanced by content structures which maintain their position in the flow to enable the document to be sensible: Eg. Lists, extracts, notes references.

The flow is also enhanced by content structures which are defined in the flow but can behave separately but in relation to the flow and are generally referenced from the flow (but not always) for positional context. Eg. Figures, tables, notes text.

The flow can also be modified where appropriate by presentation modifiers such as page and column layout structures for a specific presentation environment.

Next there is processed content, which is referenced from the flow, document structures and content structures and positioned back into the flow. Eg. Table of Contents, Indexes, End Notes, etc.

FX uses standard XHTML elements and attributes, but “uses” the Class attribute to adequately describe the structure for a processor.

Unique to XHTML documents are IDs. The provide linking references and allow the specific referencing of structures. This is valuable everywhere, but especially when content needs to become interactive.

XML structures are further complicated by the fact that retrodigitized material has a parallel pagination structure from the source document which is ideally maintained to make TOC numbers, index numbers, and external references make sense and provide a linking target.

FX goes a little further. It can also maintain original lines where this information is available through the content conversion process. It is generally available from OCR and PDF extracted content. Preservation of lineation allows facsimile reproductions such as the IGP:LineByLine PDF™ to be generated.

Lineation preservation has provides considerable uses for text proofing and processing requirements and environments. For example IGP:Proof 21 present content by line with a picture of the line from the original document scan.

 Page breaks and line beaks are expressed as HTML span statements. Generally these are created and maintained by the processing applications

Document Structure vs. Content Structure

FX discriminates between document structure and content structures. Document structures are the dominant section construction of a document that interrupt the flow based on some universally agreed and understood rules. Any FX document must have at least one document structure.

It is important to understand that publisher headings (or sub-section headings) are not part of document structure as defined here. FX regards them as flow elements.

For example a Chapter in a print book generally starts on a recto (right-hand) page, and a Major section break in a business document may also result in a new page in a print presentation context. In addition there are separately maintainable structures such as document Title, Copyright, Contents, Dedication, Appendix, etc. that signify clear breaks in a document.

Content structure consists of items such as title blocks, headers, noteboxes and other items that describe and qualify the purpose of their enclosed content rather than interrupt the document.

FX leaves nesting of document structures to format processors if required. XML has the problem that it can only describe one structure in full well-formed and/or validated files. Rather than force the source document into a pre-defined structure, this is left to the format processor.

There are many uses for content that do not need or want pre-defined XML nesting. There is a standard set of structural descriptors that handle most common (and some uncommon) document needs.

Document Sections

 FX supports a very wide range of document structures which can be extended at any time and which can be mixed in any manner. Generally they are used together in digitization and authoring, but in reuse scenarios may create documents with significantly different structures. For example a book Chapter and a document Topic may be combined in a new document. This means there must be harmony in the basic XHTML structure and naming conventions across a wide range of content types.

A very real example is eBook structuring where half title pages are removed, the copyright is moved to the back of the book and other changes are made.

Structure is tagged as in-line Processing Instructions. If a processor needs to break a document apart for packaging or any other kind of content use it can use its own rules to create the required output.

Section Processing Instructions

Most presentation tools display the visual aspect of content without particularly requiring the reader to understanding the implied or interpreted underlying structure. Frontmatter, body and backmatter structures are self contained and do not nest within each other.

The primary book structure, when interpreted in an XML-like way has nested structures to observe the well-formedness rules.

Retro-digitized books also have a parallel physical structure which needs to be maintained for referencing by page numbers, while for front-list books page numbers are generated only when the book is generated as a PDF. The document structure and page structure conflict and this always has to be explicitly addressed.

FX wraps all major document sections into named <div> elements, but keeps them in a flat structure. Book contents are contained in a single galley div element. The XHTML <body> element is not used.

A book structure is illustrated with this XHTML fragment. Note there are two CSS values with each statement allowing a processor to understand the part and its membership. Each document section also has an ID which omitted for clarity.

<div class=“galley”>
  <div class= “frontmatter-rw Title-rw”>
    ...
  </div>
  <div class= “body-rw Chapter-rw”>
   ...
  </div>
  <div class= “backmatter-rw Index-rw”>
    ...
  </div>
</div>

Generally the start of the body of a document is implied from page numbering. However this is not always the case. A significant number of books include Introduction in the body numbering style (as an example). There can also be ambiguity whether sections such as Appendices are in backmatter or the book.

The objective of FX is to remove these ambiguities in the interest of cost effective, future-proofed retrodigitization, and to use parallel XHTML for front list tagging for sophisticated book and document production.

Document Genres

FX is possibly unusual in that nothing in the basic XHTML is used to define the document genre.  There is no root element definition for book, Journal,

Books

The major book structure components available in IGP:FoundationXHTML are presented here in tabular view. These are then further expanded.

These structures are used in a wide variety of books, over many centuries. Obviously all structures are not required in all books but are included especially to assist the correct interpretation of classical, historical and organic works.

Please note this list is subject to continuous change and is representative of current named document structure items only.

Major Components

Frontmatter components

Backmatter components

Book-rw

Cover-rw

Epilogue-rw

frontmatter-rw

Series-title-rw

Afterword-rw

Body-rw

AboutTheAuthor-rw

Conclusion-rw

backmatter-rw

HalfTitle-rw

Appendix-rw

volume-rw

Title-rw

Notes-rw

part-rw

Copyright-rw

Glossary-rw

chapter-rw

Dedication-rw

References-rw

chapter-run-on-rw

Epigraph-rw

Index-rw

section-rw

Acknowledgements-rw

Colophon-rw

article-rw

TableOfContents-rw

 

topic-rw

ListOfIllustrations-rw

 
topic-run-on-rw

ListOfFigures-rw

Specials

 

ListOfMaps-rw

Advertisement-rw

 

ListOfPlates-rw

Preview-rw

 

ListOfDrawings-rw

AboutTheAuthor-rw

 

ListOfMedia-rw

Review-rw

 

ListOfTables-rw

Excerpt-rw

 

ListOfAbbreviations-rw

 
 

Foreword-rw

 
 

Preface-rw

 
 

Introduction-rw

 

 

Table xx Your Caption

 This list can be easily expanded and/or modified for any specific project requirements or to address different types of documents.

Other Document Structures

There are many other types of documents which have their own contextual grammar for section breakdown. The physical layout of these can also be considerably different from a book. An article in an academic work, magazine and newspaper are all quite different in scope, size, complexity and purpose.

Following are some major document types and their document sections. Many of these are significantly different to books and do not represent page breaks.

Periodicals (Magazines & Newspapers)

Periodicals have two primary structures, columns, articles and advertisements. Columns and articles are of many types and often have content blocks that continue on different pages.

They have difficult to handle structures such as

  • Page-rw
  • Section-rw
  • SectionContinued-rw
  • Article-rw
  • ArticleContinued-rw
  • Column-rw
  • ColumnContinued-rw
  • advertisement

This list can be extended a lot further to include specific named sections and content blocks such as departments, classifieds (which should be able to go seamlessly in and out of a database), and of course the cartoon section!

Historical Manuscripts

Manuscripts may or may not be turned into text. Where the source document is maintained as an image, the metadata tools of FX can be used effectively to create preliminary and expanding metadata about a document image. Metadata is very important in this work as often there is no text involved or the text is not easily extracted, and certainly not with OCR. Examples include hand-written manuscripts in many languages and on many mediums including palm leaves.

  • covers
  • folio
  • Item

Business Documents

Commercial documents can have parts, but generally don't have chapters. They contain sections and topics which are often arbitrary, author defined section types.

  • title-page
  • document-control
  • Contents
  • body
  • topic
  • section
  • appendix
  • references

Legal documents

Legal documents do not break down into a significant number of parts. They are similar in style to corporate documents,but have some significant internal structures that can be beneficially maintained in template format. Many contracts can be highly repetitive and strictly maintained (such as the IGP Reseller agreement or various software licenses). Maintaining these in template form is very beneficial, productive and ensures there is no dilution or un-warranted modification of essential clauses.

  • Parties
  • recitals
  • body
  • signatures
  • Schedules, Appendices

Marketing documents (brochures)

Templated documents may not be ideal for "creative" MARCOM. It can be very powerful when used with repetitive MARCOM that uses and reuses similar specification items.

  • outside-front
  • outside-back
  • inside-left
  • inside-right
  • inside
  • branding-block
  • product-block
  • contact-block

Learning Collateral

This is a massive set of information which utilizes features of documents and more formal publications such as books.

  • Topic
  • lesson
  • lesson-plan
  • assignment-requirement
  • assignment
  • Self-study

Social Web

  • Web-page-flat
  • blog-page
  • forum-thread-items
  • wiki-page

Each of these content domains is a specialist area in themselves. But all of them can be treated with a considered XHTML strategy with appropriate metadata, and still be able to interface with specific domain standards. Just some interchange standards that can be easily processed from FX are:

NewML - News Markup Language

NITF - News Industry Text Format

EAD - Electronic Archive Description Language

METS - Metadata for Encoding and Transmission Standard   

 

Pagination

For back list converted content the source document page structure can be maintained in FX. This is useful for "broad" index linking. Page breaks occur at the start of the page and mark-off all content and structures on that page. Line breaks occur at the end of a line and if present must have a space before them to ensure correct line jointing.

<p>**********
     <span class=”pagebreak-rw” id=”page-seq”>iv</span>
</p>

This must always be a <span> element. It must always be inside a block or flow element as the first or last item.

Note the “Real Page Number” is the element text. The ID is the source sequence number. This may be discontinuous, but FX should include:

  1. All source document pages in sequence.
  2. All blank pages.

 Page-break tags should not be used where a source document does not have referenced pagination (eg. An HTML file, extracted Word Processor file).

Lineation

 Lines break markers are at the end of a line, but never at the end of a paragraph. The form is:

<p>**** 
    <span class=”linebreak-rw”></span>
****</p>

 In a text sting context they look like this:

<p>this is a line of content <span class=“linebreak-rw”></span>
this is the second line of content <span class=“linebreak-rw”></span>
this is a line of content with a hyphen¬ <span class=“linebreak-rw”></span>
ation point. This is the last line of content. </p>

 Note that all lines, including the closing paragraph have end spaces. These must not be removed as their consistent presence enables the automated joining and splitting of content blocks. Without this consistency automated strings joining behaviour is not be predictable. This is a critical issue and all IGP creation processors uniformly enforce this condition.

 There is a complex condition at the end of a page where both linebreak and pagebreak occur. The linebreak is not shown as it is implied by the the pagebreak.

<p>this is a page last line of content <span class= “pagebreak-rw”></span>
   that goes forward to the next page.</p>

This should be read by a processor as: There is a page break after this paragraph.

<p>Line of content that goes </p>
<p><div class= “pagebreak”></div></p>
<p class= “para-continue”>onto the next page before ending</p>

A processor can split the content for a format that uses paginated presentation, or join the two paragraphs if the format requires continuous text. There is a complex set of continue statements to cover all bodytext variants lists, noteboxes and other structures that go across pages. These are processed in if original paginated presentation is required in an output presentation context.

comments powered by Disqus