tools for document interoperability

 
privacy 
 
 
attribution 
 

This is the web diary for nfoWorks and realization of the Harmony Principles. Pursuing Harmony tracks nfoWorks research, analysis, specification, and implementation of tools for document interoperability. There is commentary on related activities that address conformance, interoperability, and harmonization of document formats.



Click for Blog Feed
Blog Feed

Recent Items
 
Republishing before Silence
 
The Real Challenge of Achieving and Sustaining Int...
 
Let’s Try This for a While
 
Hooptedoodle: Blog Aversion and Standards Ignoranc...
 
ODF Implementation-Support Toolkits and Libraries
 
Adding Pursuing Harmony to Technorati
 
Office Shots for Confirmed ODF Interchange Fidelit...
 
ODF Interoperability at The Hague
 
ODF and IPR/Licensing Concerns
 
Open Government Data: Simple Principles

This page is powered by Blogger. Isn't yours?

Locations of visitors to nfoWorks

The nfoCentrale Blog Conclave
 
Millennia Antica: The Kiln Sitter's Diary
 
nfoWorks: Pursuing Harmony
 
Numbering Peano
 
Orcmid's Lair
 
Orcmid's Live Hideout
 
Prof. von Clueless in the Blunder Dome
 
Spanner Wingnut's Muddleware Lab (experimental)

nfoCentrale Associated Sites
 
DMA: The Document Management Alliance
 
DMware: Document Management Interoperability Exchange
 
Millennia Antica Pottery
 
The Miser Project
 
nfoCentrale: the Anchor Site
 
nfoWare: Information Processing Technology
 
nfoWorks: Tools for Document Interoperability
 
NuovoDoc: Design for Document System Interoperability
 
ODMA Interoperability Exchange
 
Orcmid's Lair
 
TROST: Open-System Trustworthiness

2009-06-19

 

Office Shots for Confirmed ODF Interchange Fidelity

The new Officeshots.org service received a fair amount of attention at the recent ODF Interoperability Plugfest.  Taking a page from the “test your site with all browsers” tools that are available, Office Shots will take an uploaded ODF document and show how it renders in different ODF-supporting products.  To deal with the problem of confirming appearance of the document back to the submitter, the rendering by each application is captured in PDF.

This is a fledgling service, currently in limited beta.  It is sponsored by the same Dutch organizations that sponsored the ODF Plugfest.

The power of the service is its user-relevant confirmation of the fidelity with which a document of interest is rendered by different ODF-supporting software/platform combinations.  It is an easy way for evaluators to verify whether their important documents are rendered successfully in interchange among ODF products.  It also allows the subjective determination of success to be left in the hands of the users who know what qualifies as acceptable fidelity in each particular case.

One of the most-difficult situations in interchange of documents is when the receiver is seeing something materially different than what the sender (1) had in mind and (2) expects has been communicated.  For the parties to communicate about a suspected difficulty, they need to use a “channel” that differs from the one that has apparently failed.  Screen shots serve that purpose.  PDF is also valuable in the case where a PDF can be extracted that accurately-enough reflects what is intended and/or what is being seen.

Office Shots provide a way to proactively check, either because a problem is suspected with a local rendition or to ensure that a document and the choice of implementation-supported features is treated consistently by a variety of other implementations/platforms.

One can imagine that, over time, we could see Office Shots support links for troubleshooting specific discrepancies, finding practices for avoiding many of them, and easy reporting of problems to development teams.

Office Shots promises to provide a terrific reality-based approach to confirming the interoperability of ODF implementations as far as presentation fidelity is concerned.  This is also a first-line check on confirming difficulties with round-trip inter-product fidelity preservation.  (Of course, if the goal is solely presentation fidelity, PDF and other final-form formats may be preferable, especially when long-term preservation is also a consideration.)

I look forward to the impetus that Office Shots will provide to user recognition of practical ODF interoperability considerations.  I also think it will provide important stimulus and confirmation for developers who want to improve the interoperable use of their ODF-supporting software.


Beside the Officeshots.org site, there are other discussions of the project and its potential:

  • Glyn Moody: ODF and the Art of Interoperability.  Open Enterprise (blog), ComputerworldUK, 2009-06-19.
  • Sander Marechal: Easily testing ODF compatibility (odp, pdf).  Presentation to the ODF Plugfest, 2009-06-15.  [In this case, the PDF renders more poorly than the ODP on my computer.  I assume the problem is in the production of the PDF via the ODP implementation, yet another Officeshots interoperability case.]
  • Sander Marechal: Officeshots.org.  Product submission, OpenDocumentXML.org, 2009-02-06.

Labels: ,

2008-10-15

 

Simplifying Speech-Enabled Applications

Via the Cover Pages: XML Daily Newslink for 2008-10-14, I learn that the W3C has standards for speech-enabled/-enabling web applications.  The addition announced today is the W3C Pronunciation Lexicon Specification (PLS) Standard.  This is an accessibility as well as a convenience feature.  PLS is intended to work with Text to Speech (TTS) and VoiceXML applications. 

The PLS lexicon is an XML document and there is allowance for blended use with other namespaces.  This suggests to me that there is prospective use in interchange of office-productivity documents for various purposes.

I don’t expect that this will fit into any foreseeable level of harmonized features.  I am placing this marker because it may well feature in accessibility provisions at some point, even if accomplished via a public-profile agreement involving foreign elements.

The accessibility angle is an important one to keep an eye on for its interoperability, interchange, and preservation potential.

Labels: , ,

2008-09-07

 

Document Interoperability: The Web Lesson

"are there alternatives to google groups search for searching old USENET messages? because groups date fielded search is teh broken."

-- Richard Akerman on Twitter, 2008-08-31 

Be prepared for a dramatic shift in the reality of web-site browsing and the honoring of web-page standards.   The pending release of Microsoft Internet Explorer 8 is going to put the reality of web standards and their loose adherence in our faces.  Although Internet Explorer is indicted as the archetypical contributor to disharmony on the web, Internet Explorer 8 is going to challenge all of us to deal with the reality of our mutual contribution to the current state of affairs.

Here is a lesson, probably many lessons, for document interoperability and the way that standards for document formats evolve and harmonize, or not, over time.

The Web as Clinical Science

The movement from loosely-standard pages and their browsing to strictly-standard pages and standards-mode browsing will illustrate every aspect of the same challenge for office-productivity documents and the office suites that process them. 

Web pages are the experimental drosophilae of digital documents.  All aspects of dynamic convergence on standards, themselves evolving, and the forces of divergence, are demonstrated clearly and rapidly.  I expect it to take Internet generations for significant convergence, with no static level of standards adherence anywhere in sight.  It took us almost 20 years to get to this point on the Web; I figure it will take at least five more to dig out of it far enough to claim that there is a standards-based web in existence and in practice.  I'm optimistic, considering that HTML 5, the great stabilization, is not expected to achieve W3C Recommendation status until 2012.

No document-interoperability convergence effort is anywhere close to the promising situation of the web as Internet Explorer 8, HTML5 implementations, and other compatibility-savvy browsers roll out over the next several years.  It is useful to use that situation to calibrate how convergence and interoperability could work for document interoperability.  There are significant technical barriers.  The non-technical barriers are the most daunting.  That should be no surprise.

Versioning in Document Use

I've written on Orcmid's Lair about the IE 8.0 Disruption.  This involves changes in Internet Explorer 8.0 by which web pages are rendered in standards-mode on the assumption that pages are conformant with applicable web standards.  In the past, it was presumed that pages were loosely-standard and browsers, also loosely-standard, made a kind of best effort to present the page.  The consequences have been explained marvelously in Joel Spolski's post on Martian Headsets.

We are similarly relying on document-format standards as a way to provide for many-to-many interchange and interoperability between different (implementations of versions of) document-format standards and different (implementations of versions of) processors of those digital documents.  That means we have a version of the loosely-standard documents with loosely-standard processing problem.  We can't be strictly standard because the standards can't (and definitely don't) have strict implementations at the moment; and there are many ways that specifications and implementations have been kept loose by design.  Accompanying that looseness by design is the the simple fact of immaturity among the contending document-format standards for office applications, particularly as vehicles for interoperable applications.

For office-productivity documents as we know and love them, there are five, count 'em five "official standards." 

The "Official" Public Standards of Office Documents

For Office Open XML Format (OOXML), there is the ECMA-376 specification of December 2006.  There is also the ISO/IEC 29500:2008 Office Open XML File Formats standard once it is made available.  IS 29500 will have some substantive differences from ECMA-376.  We won't have a solid calibration of the differences until the IS 29500 specifications are available and subject to extensive review.

For the OpenDocument Format, there is the Open Document Format for Office Applications (OpenDocument) v1.0 OASIS Standard issued 1 May 2005.  There is also the ISO/IEC 26300:2006 Open Document For Office Applications (OpenDocument) v1.0 standard (also on the publicly-available listing).  IS 26300 is for the same format as the OASIS v1.0 standard, but it is on a completely-separate standards progression.  Appendix E.3 accounts for the differences of IS 26300 from the text of the May 2005 OASIS Standard.  The first page of the IS 26300:2006 document (page 5 of the PDF) identifies its source as Open Document Format for Office Applications (OpenDocument) v1.0 (Second Edition) Committee Specification 1, dated 19 July 2006, derived from document file OpenDocument-v1.0ed2-cs1.odt; this is not another OASIS Standard, however.

The second and latest OASIS Standard for ODF is Open Document Format for Office Applications (OpenDocument) v1.1 issued 2 February 2007.  This document is derived from OpenDocument v1.0 (Second Edition) Committee Specification 1, the same specification that is the source of content for ISO/IEC 26300:2006.  The changes made to arrive at ODF v1.1 from the v1.0 (Second Edition) committee specification are detailed in Appendix G.4.  There are some mildly-breaking changes from ODF v1.0 to ODF v1.1, mostly of a clarification or correction nature.  There are a few additional features that have no down-level counterparts in ODF v1.0.

A third OASIS Standard, ODF v1.2, is under development.  The current drafts, using a very-different organization from v1.1, are available as pubic documents of the OASIS Open Document TC. 

We can expect to see more versions of ODF and of OOXML at their various standards venues.  We'll be watching here on nfoWorks as the situation becomes even more chaotic.  Notice that this diversity ignores the variety of divergent implementations of the various specifications.

Format Versions that Live Forever

It is possible for one document-format specification to officially supplant another, with the older specification deprecated.  That has not been done so far with any of the five-and-growing document-format specifications, any more than it has been done for most of the versions of HTML specifications that have been recommendations of the W3C (and IETF before the development track moved entirely to W3C). 

For example, the last full-up specification for HTML, the HTML 4.01 W3C Recommendation of 24 December 1999, has this to say about its immediate predecessor: "This document obsoletes previous versions of HTML 4.0, although W3C will continue to make those specifications and their DTDs available at the W3C Web site."  This was possible because HTML 4.0 was young and there were important defects that 4.01 cured.

The HTML 4.01 specification continues with the following recommendation: "W3C recommends that user agents and authors (and in particular, authoring tools) produce HTML 4.01 documents rather than HTML 4.0 documents. W3C recommends that authors produce HTML 4 documents instead of HTML 3.2 documents. For reasons of backward compatibility, W3C also recommends that tools interpreting HTML 4 continue to support HTML 3.2 [W3C Recommendation 14 January 1997] and HTML 2.0 [IETF rfc1866 November 1995 and the IETF-obsoleting rfc2854 June 2000] as well." 

The XHTML branch of specifications, originally derived from HTML 4.01, were intended as the basis for a future generation. 

Meanwhile, there has been work toward both XHTML 2 and HTML 5.0

HTML 5.0 is currently intended to exist alongside XHTML 1.x and its newer arrangements while also absorbing XHTML 1.x to some degree (by having an XML form).  The current HTML 5.0 draft specifies legacy processing (in its HTML-syntax form) for variations of over 60 HTML DOCTYPE DTD flavors, extending back to HTML 1.0 and other variants.  The intention is to converge HTML and XHTML 1.x under a consistent HTML 5 processing model with only no-quirks, some-quirks, and quirks modes.  This is also intended to end the variation and extension of HTML (not XHTML) by capturing <!DOCTYPE HTML> for its own and having a concrete HTML syntax that is fully-divorced from both SGML and XML.  It is important to point out that HTML 5 is not going to eliminate the divergence that browser (user-agent) plug-in models, plug-in implementations and scripting systems (especially client side) bring to the mix.

Document-format versions are not easily abandoned.  Even if production of a format is deprecated, consumption of the format may need to continue into the indefinite future, and certainly so long as emitters of deprecated formats have significant usage.  The W3C progression of HTML is at a point where that is fully-recognized and being honored in reaching toward an HTML 5 plateau sometime in the next decade.

Considering this promising stabilization, when would I manage to change all of my web sites and blogs to clean HTML 5 pages?  Not until I know that visits to those sites are only a small fraction of Internet Explorer versions prior to IE8 (or maybe IE9) and other browsers lacking full-up standards-mode processing.  Fortunately, the HTML 5 specification-effort promises to show me exactly how to do that in a mechanical way.  I am looking forward to automated assistance.  In my case, I'll also have the benefit of my IE 8.0 mitigation effort.  Other web sites may require other approaches, and user browser choice will involve important trade-offs for some time. 

I am surprised by the number of people who operate multiple browsers.  Although I operate multiple products for office applications these days, that's mostly to explore their interoperable use, not to ensure ability to interchange documents (well, not until I joined OASIS and the ODF TC).  I've been a serial adopter of Internet Explorer versions since IE 2.0.  As a typical late-adopter, I may finally branch out now just to have a better calibration of the migration to standards-based sites and browsers for them.

This is an important lesson for the management of the expanding variety of specifications of formats for office-application documents, formats of which HTML packagings are sometimes one of the flavors.

Reconciling office-application document-format versions does not promise to be so easy as the current effort to stabilize HTML for the web.

The Looseness of Document Specifications

Of course, OOXML and ODF are not close dialects off a single family tree, as HTML variants might be treated (and HTML 5 demonstrates, if successful).  In addition, the current specifications are not for same-conformance, interchangeable-everywhere documents:

  • There are weak conformance requirements.  It is not necessary to implement any particular amount of the specified format: OOXML or ODF.  This is by design.  I don't expect that to change.  There is also no way to indicate how much or how little is accepted and/or produced.  Well, you could look to see what software produced the document, using ODF as our example:

<office:document-meta
    xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
    xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
    office:version="1.2">
  <office:meta>
    <meta:generator>
        OpenOffice.org/3.0_Beta$Win32 OpenOffice.org_project/300m3$Build-9328
    </meta:generator>
  </office:meta>
</office:document-meta>

This strikes me as even less appealing than the challenge of sites adjusting for browsers and browsers adjusting to HTML DOCTYPE declarations (and their absence).
   
It is not encouraging that the office:version attribute and <meta:generator> element are both optional.  It is unfortunate that the office:version attribute is generally uninformative about the processing requirements for the document file in hand, serving merely as an automatic claim of one specification the document conforms to.  The document is also likely to conform to earlier versions and probably alter later versions, although it is unclear how we can determine that easily for a given document representation.

  • Arbitrary "foreign" elements are allowed.  I'm not clear how IS 29500 for OOXML will allow for this kind of thing, but the ODF specifications are justly-notorious for this provision (ODF 1.1, section 1.5):
      
    "Documents that conform to the OpenDocument specification may contain elements and attributes not specified within the OpenDocument schema. Such elements and attributes must not be part of a namespace that is defined within this specification and are called foreign elements and attributes.
      
    "Conforming applications either shall read documents that are valid against the OpenDocument schema if all foreign elements and attributes are removed before validation takes place, or shall write documents that are valid against the OpenDocument schema if all foreign elements and attributes are removed before validation takes place.
      
    "Conforming applications that read and write documents may preserve foreign elements and attributes."
      

    There are some further wrinkles and this proviso:
      
    "Foreign elements may have an office:process-content attribute attached that has the value true or false. If the attribute's value is true, or if the attribute does not exist, the element's content should be processed by conforming applications. Otherwise conforming applications should not process the element's content, but may only preserve its content. If the element's content should be processed, the document itself shall be valid against the OpenDocument schema if the unknown element is replaced with its content only."
      

    As a developer, I love gimmicks like this.  But, basically, this only works with processors that re-encounter document files that they themselves produced.  Anything more coherent requires that the implementers of different processors form some sort of out-of-band, separate-from-the-standard interoperability agreement on particular foreign elements and handling of office:process-content attributes.  Users, confident that their software is "standard," will have frustrating and inexplicable interchange experiences (unless the usual thing is done and everyone agrees to lock in on the same software [version], surprise, surprise).

    OOXML has a versioning scheme that might provide controlled extensions that degrade usefully when processed by implementations of down-level specification versions.  It is unclear at this point whether this is just a more complicated way to end up with the same interoperability problems.

  • Some features require foreign content.   Both OOXML and ODF have features where content is represented by a binary-data part elsewhere in the package.  There is little (OOXML) or no (ODF) indication of what the format of the binary element is and what MIME types are allowed for such document components.  All use of those features and any interchange agreements about them are beyond the current provisions of the relevant document-format standards.
      
    There are other places where implementation-defined values are expected and are expected to be preserved by other implementations.

  • Some values and default selections are implementation-specific.  I was mining in the ODF specification the other day.  I did not expect to find attributes having text on these patterns:
       
    "The value of this attribute is implementation [or application] specific."
      
    "If this attribute is not present, the application might or might not display [whatever]."

    These are relatively minor considering the amount of variability from the other conditions already mentioned.  What's curious about these is the elevation of particular implementation-specific features as specification-favored. In the case of implementation-specific attribute values, there is also the interesting problem of a processor determining whether such a value is intended to have its implementation-specific interpretation or not.  It appears that the related features will only be useful under tightly-restricted interchange conditions.
      
    I will not be surprised to find similar looseness in the OOXML specification, IS 29500.

Prospects for Interoperable Convergence

We already have before us difficulties with interoperable convergence of individual progression of a single standard and its variety of implementation.  This makes the prospect of harmonization between different standard formats rather murky.

Desktop office-application software has more promise with regard to application of Postel's Law, to be liberal in what is accepted and conservative in what is produced.  Unfortunately, the current specifications do not require conservative, interoperable implementations; the current specifications are arguably antagonistic to such an achievement.

I suspect that this is an unintended consequence mixed with some inattention to what it takes for interoperability to be achievable. 

It remains to see how our experience and understanding matures.   We are at the beginning, not the finish.  The journey may seem endless.


The process of IE 8.0 mitigation and preparation for a standards-mode approach to web browsing impacts this site and blog as well as every other web page I have ever posted (somewhere over 120MB worth and climbing).

I'm not going to say anything more about IE 8.0 mitigation and HTML harmonization here.  The overall effort will be tracked in that category of Professor von Clueless posts; that's the place to follow along.  The lesson for document interoperability is something that is definitely appropriate for Pursuing Harmony; there'll be much more to say about that.

Labels: , , , , ,

2008-08-19

 

Agreement on Document Rendering

One problem for harmonization of document-format implementations in the case of ODF and OOXML is the degree to which those specifications may provide inadequate specification of how documents are to be presented in order for implementations to be adequately interoperable.  This situation arises between implementations of the same format as well as between different document formats.

Along with establishing clarity on how much agreement in presentation is required, there is the small matter of being able to somehow confirm that an application achieves whatever that level of conformance and interoperability is.

As I suggested in "Interoperable ODF: Finding Ground Truth," there are many difficulties to be conquered in advancing from the current state of affairs to where there is reliable determination that implementations are substitutable in a particular interoperability setting. 

We don't have good ways to even talk about the multiple, interlocking problems that lurk beneath the simple desire to have interchange of documents in standard formats "just work."

It would be helpful, for starters, to at least have a way to describe what a particular document-processing system does in rendering documents that it accepts.  A checklist on the handling of particular features of an electronic document is also useful in determining conformance and interoperability guidance and, perhaps, eventual mechanical verification criteria.

The XML Daily Newslink for 2008-08-18 reports on a contribution that may be useful in this regard, the "First Edition of ISO/IEC 24754: Minimum Requirements for Specifying Document Rendering Systems."  From the scope:

"This International Standard provides an abstract list of the features that a document rendering system may have, thus providing a frame of reference, against which the user and implementor can compare the  features of a document rendering system. However, this International Standard does not direct how each document rendering system should behave.

"This International Standard provides the minimum requirements to specify the features that a document rendering system which transforms formatting objects to rendering output. It may be used as a frame of reference, against which the user, implementer, or software agent may compare the features of a document rendering system.  According to these requirements, the user may express what he or she expects of a document rendering system, the implementer may describe the functionality and capability of the document rendering system that he or she implements, and the software agent may negotiate a minimum set of functionality and capability that are shared across different document rendering system implementations."

The Final Committee Draft (omitting the example in the informative appendix) is available as a 7-page, 74 kB PDF file

Beside the announcement, here's what attracted my attention:

  • The list of features is abstract.  The descriptions of how individual feature are handled would normally be in prose, with possible reference to standards applicable to the feature (e.g., font substitution).
      
  • The document assumes an abstract processing model related to that for SGML and XML.  The presumption is that rendering is specified separately from logical document format and content.  It can, of course, be adapted to the rendering provisions of processors for electronic office documents even though rendering conditions are embedded in the specification for the document format and its elements.  So it is definitely minimal and something to start from.
      
  • This work was initiated in 2005.  The Final Committee Draft download file is dated 2007-10-29; its electronic document was last edited on 2007-01-23.  The standard, ISO/IEC 24754:2008 was published on 2008-08-15.  It costs 96 CHF (about $100 US) and is not (yet?) on the list of publicly-available (i.e., free) standards, last updated on 2008-08-08.

Labels: , , ,

 
Construction Structure (Hard Hat Area)
Creative Commons License You are navigating nfoWorks.
This work is licensed under a
Creative Commons Attribution 2.5 License.

template created 2008-08-13-18:06 -0700 (pdt)
$$Author: Orcmid $
$$Date: 13-11-11 19:13 $
$$Revision: 3 $