tools for document interoperability

 
privacy 
 
 
attribution 
 

This is the web diary for nfoWorks and realization of the Harmony Principles. Pursuing Harmony tracks nfoWorks research, analysis, specification, and implementation of tools for document interoperability. There is commentary on related activities that address conformance, interoperability, and harmonization of document formats.



Click for Blog Feed
Blog Feed

Recent Items
 
Republishing before Silence
 
The Real Challenge of Achieving and Sustaining Int...
 
Let’s Try This for a While
 
Hooptedoodle: Blog Aversion and Standards Ignoranc...
 
ODF Implementation-Support Toolkits and Libraries
 
Adding Pursuing Harmony to Technorati
 
Office Shots for Confirmed ODF Interchange Fidelit...
 
ODF Interoperability at The Hague
 
ODF and IPR/Licensing Concerns
 
Open Government Data: Simple Principles

This page is powered by Blogger. Isn't yours?

Locations of visitors to nfoWorks

The nfoCentrale Blog Conclave
 
Millennia Antica: The Kiln Sitter's Diary
 
nfoWorks: Pursuing Harmony
 
Numbering Peano
 
Orcmid's Lair
 
Orcmid's Live Hideout
 
Prof. von Clueless in the Blunder Dome
 
Spanner Wingnut's Muddleware Lab (experimental)

nfoCentrale Associated Sites
 
DMA: The Document Management Alliance
 
DMware: Document Management Interoperability Exchange
 
Millennia Antica Pottery
 
The Miser Project
 
nfoCentrale: the Anchor Site
 
nfoWare: Information Processing Technology
 
nfoWorks: Tools for Document Interoperability
 
NuovoDoc: Design for Document System Interoperability
 
ODMA Interoperability Exchange
 
Orcmid's Lair
 
TROST: Open-System Trustworthiness

2009-12-05

 

ODF Implementation-Support Toolkits and Libraries

I have no appraisal of the relative maturity and quality of the various toolkits that are emerging on the ODF scene (and likewise with regard to OOXML).  However, it is important to have a cataloging of what there is.  This is a random start.  I will add to this post and build an nfoWorks catalog page later:

  • lpOD: languages & platforms OpenDocument Project (also Français). 
    Definition of a Free Software API implementing the ISO/IEC 26300 standard.
    Development, for higher level use cases, in Python, Perl and Ruby languages.
    of a top-down oriented API.  Licensing is under Free Software Foundation (FSF) versions.

My interest

An important resource for ways to harmonize document formats involves attention to the libraries and models employed for constructing document-centric software and their applications.  This applies for the development of testing and conformance tools as well as for implementation of format-supporting software products.  Indeed, one might reasonably expect that such tools would be a companion demonstration of implementation-support quality.

In the interesting case of OpenDocument Format, the availability of open-source code bases for implementations is both a risk (in that deviations or omissions in support for the standards is are perpetuated through code mimicry) and an opportunity for faster tooling and testing.  Of course, closed-source implementations (and related toolkits) have their own dangers in this regard, while denying public inspection of the code.  I suspect that implementation notes are required in all cases to ensure understanding of intentions and interpretations as well as limitations and the different ways that discretionary matters are handled.

For ODF, the continuing work on toolkits and on independent open-source implementations is providing important diversity.  This can inform the search for a harmonious profile and perhaps suggest adaptations that encourage harmonious implementations.  Diversity across platforms and programming models may also help in the recognition and abstraction of essentials away from implementation incidentals.  That can also be valuable in ensuring that harmonization is on essentials and not accidents of implementation.

I will be reviewing available toolkits, libraries, and APIs as I define my own around interface contracts for abstracted levels of document models and processing support.  I expect some cross-fertilization while adhering to a model that is concentrated on harmony.

Labels: , ,

2009-06-15

 

ODF Interoperability at The Hague

There’s a great event at The Hague these two days: June 15-16, 2009.  It’s all about OpenDocument Format (ODF) and interoperability

It is sponsored by a neutral (ODF-supporting) organization. It is attended by major implementers of ODF-supporting products, including IBM, Microsoft, and Sun Microsystems.

In short, all of the right people are in the same room, some for the first time, and I am so envious that I am not among them.  There should be a great deal of creative tension.

I will be watching for materials and progress reports.  There is already Doug Mahugh’s useful pre-event post on how Microsoft tested the ODF implementation in Office 2007 SP2 to ensure that it only produced standard-conforming documents and failed in ways that did not introduce security exploits against the Office System or documents of its users.

I have been meaning to post more about my involvement with ODF and how it is fueled by my interest in the harmonious level at which we can start and expand interoperability based around standard, open formats for office-productivity applications.  I will do that separately.  For now, I just want to register my excitement for the positive stage that participation at this meeting represents.


[Update 2009-06-16-18:56Z There are little odds and ends available from the ODF Plugfest so far, and I will compile some links here for safe-keeping.  I am sure there will be additional blog posts and reports by more attendees after they have had some time for reflection]

[Update 2009-06-17-17:11Z with a few more straggling in]

[Update 2009-06-18-17:51Z as other posts show up]

[Update 2009-06-23-14:55Z with some stragglers]

[Update 2009-06-24-18:55Z and one more interesting appraisal]

  • 2009-06-23 Sven Langkamp: ODF Plugfest.  (blog post) Sven’s Blog.  Useful perspective regarding participation by KOffice, an independent implementation of the ODF specification.

[Update 2009-06-27-21:40Z and the hits keep on coming …]

[Update 2009-07-01-15:25Z wrapping up, with anything more on plugfests in future posts]

Labels: , , , , ,

2009-06-09

 

ODF and IPR/Licensing Concerns

Here are some apple-orange notions that have come to my attention in an oddly-convergent way.

New OASIS Technical Committee IPR Mode

OASIS has just announced the pending addition of a 4th IPR Mode to the set that technical committees can use as the way intellectual property (mainly essential claims of patents) will be made available to adopters of a TC-produced specification:

  1. RAND Mode, requiring the essential IPR of participants and contributors to be licensable under Reasonable And Non-Discriminatory terms
  2. RF on Rand Terms Mode, a Royalty-Free RAND mode that may have certain limitations
  3. RF on Limited Terms Mode, where the limitations allowed to RF on Rand Terms are not allowed
  4. Non-Assertion Mode, the new mode in which all contributors and participants make a non-assertion covenant with regard to the specifications that obligate them to do so

The ODF TC operates under the RF on Limited Terms Mode, the most-generous mode available until now.  As stated under the OASIS IPR Policy, a TC may not change its IPR Mode without closing and submitting a new charter.  I don’t expect such a shut-down and restart to happen, especially before ODF 1.2 becomes a ratified OASIS Standard.

Many will welcome this new mode.  I know that my willingness to participate in OASIS Technical Committee activities increases exponentially as we move down the list.  The RF on Limited Terms and the new Non-Assertion modes are the only ones that I have no hesitation about. 

The Non-Assertion Mode is comparable to everyone obligated by the IPR mode having automatically made an equivalent of the Microsoft Open-Specification Promise with regard to the specifications produced by the TC during their participation. 

Of course contributors, participants, and anyone else can provide non-assertion covenants with regard to any specification, as Sun Microsystems did for ODF in September, 2005.

Implementation License Models and Interoperability

The licenses under OASIS IPR modes apply to implementations of the applicable specifications, such as ODF.

I have recently been dealing with provisions of the ODF specification that do not seem to be understandable on their own, not even by consulting referenced source materials.  In that case, there is no way to ensure interoperability without consulting an implementation or two.  In complex cases (such as figuring out how to decrypt an ODF document that is encrypted using the approach sketched in the ODF specification), it is actually necessary to inspect code to determine what the missing but essential details might be.  (It would be better to find implementation descriptions that explain how the specification is being satisfied, but too often the code is the only reliable implementation description.)

When the code is available in an open-source implementation, it may be possible to reverse-engineer an implementation-independent interoperable interpretation.  That is what I would look for, assuming that I could master such code well enough to resolve questions the specification leaves open. 

Consulting code works for detective work around clarification and hole-filling of the specification.  If I want to make an implementation based on that interpretation, I must be especially careful about the license on that code.  For example, LGPL and GPL code and other reciprocal-license open-source software is not useful to me in producing software under a license that I prefer (Open BSD, Apache, etc.).   I am cautious about digging around in voluminous code anyhow, but I am particularly wary about risking that I might copy GPL code.

In this case, I am reluctant to rely too strongly on an abstracted interpretation unless the specification itself is updated and issued with an interpretation I can then safely rely on.

In effect, specifications that are sufficient for implementation-independent achievement of interoperability, along with royalty-free licenses or covenants, provide the ultimate clean-room support for achievement of unencumbered independent implementations.

That’s what I’m after.

Labels: , , , ,

2009-03-29

 

Open Government Data: Simple Principles

I finally noticed the Open Government Data Principles and associated Open Government Data site and Wiki (via Doc Searls).

It strikes me how much simpler and well-framed this discussion is, contrasted with the over-stated manifesto for document freedom.

Somehow, when it is about simply-described affirmative principles, it becomes simpler to grasp and to imagine the possibilities and opportunities that are afforded.  Here are the key qualities around public government data public made open:

  • Complete
  • Primary
  • Timely
  • Accessible
  • Machine-processable
  • Non-discriminatory
  • Non-proprietary
  • License-free

with reviewable compliance.

There is more to be found on the wiki, and anyone can register and add their questions and perspective to the fleshing-out of these notions.

One can splice open documents, especially the public’s documents, into this structure as well.  This puts important context around the technological issues involved in having documents in formats that everyone can use and that are freely implementable in computer software.

This has me think of a few other qualities that might matter in both domains, especially around durability/permanence.

You might have some thoughts about this too.  Visit the Open Government Working Group page for more.

Labels: ,

2009-03-25

 

2009-03-25: Document Freedom Day

Technorati Tags: ,

Yesterday, Ada Lovelace Day, I learned that today, March 25, is Document Freedom Day (via Louis Suarez-Potts).

Today’s celebration has the over-the-top theme: “global day for document liberation.”   The thesis is that

“In a world where records are increasingly kept in electronic form, Open Standards are crucial for valuable information to outlive the application in which it was initially generated. The question of Document Freedom has severe repercussions for freedom of choice, competition, markets and the sovereignty of countries and their governments.”

“The Document Freedom Day (DFD) is a global day for Document Liberation with roughly 250 active teams worldwide. It is a day of grassroots effort around the world to promote and build awareness for the relevance of Free Document Formats in particular and Open Standards in general.”

A Little Less Manifesto Please

I fancy the simpler notion of promoting “document formats that can be used by everyone and safely implemented in free software.” 

I must also caution that the existence of such a format does not assure that my computer-maintained documents will be able to survive intact beyond the availability of the specific software that I use to create and present them.  There is no causality here, as much as we would like there to be.  There is, on the face of it, a greater opportunity, but not necessarily one that I can exploit on my own.

Owning My Own Documents

Having said that, here’s what document freedom means to me:

  • There is a public format,
  • The format is safe to support in computer software without any limitation,
  • The stable preservation of the format and the consistent availability of freely-available supporting software are assured into the indefinite future.

That would satisfy me that I am truly the owner of my computer-supported documents.

It takes more to satisfy me that the choice of different platforms and products is a minor concern and there are reliable substitutes.  That would require that the level of interoperable use among (versions of) document-processing products be so high that faithful interchange of our documents and even successful roundtrip collaboration in their development and refinement are assured.

Too Slippery the Slope

I think that is worth striving for.  I don’t think we are close yet.  I don’t think any of the sloganeering and posturing is doing anything to accomplish it.  There are too many mixed agendas:

  • Ideological straight-jacketing: Confusing the means with the goal and demanding more of the means than the goal requires, even to a willingness to sacrifice the goal to the means
       
  • Magical thinking: Assuming that a given step is closing the approach toward the goal without any reality check or a way of measuring achievement of the goal and of mileposts along the road
     
  • Negativity: Opposing something rather than creating something, as if the elimination of the negative will assure the imagined positive

The Public’s Documents in Public Formats

There needs to be some serious reality-based assessment and measurability.  That’s what it takes to be secure in the ownership of my documents.  That’s what it will take to be sure that those documents that are the instruments of our civil society are indeed the public’s documents, using the public’s formats.

The lingering question, one to ask on next year’s Document Freedom Day, and then the year after that, and …, is who are the stakeholders and what action will they take to substitute reality for blind flag-following?


[update 2009-03-26T01:06Z: I should simply go to Rick Jelliffe’s blog before I open up my mouth about anything to do with open formats.  If I could ever find the blankety-blank RSS feed I would be so much happier.   Meanwhile, here are some relevant words on the status quo and the sow’s ear:

  • Rick Jelliffe: Does an ‘open format’ provide the benefits it is supposed to?  2009-03-22, um, the very topic we are speaking of
     
  • Rick Jelliffe: Another leap forward for openness?  2009-03-25, a bit of an insider cat fight surrounding the more acute question around who are the stakeholders for securing public ownership of the public’s documents.  Is it really good for open formats that developers are the arbiters of what documents last?  (I’m still reeling from “To me, the phrase ‘If it ain’t broke, don’t fix it’ is seriously anti-innovation and anti-intellectual.”)
     
  • Rick Jelliffe: Master Blaster.  2009-03-20, a compilation of Peter Sefton snippets on how dismal things are when the most successful interoperability is between Microsoft Office and OpenOffice.org using (wait for it) … Office binary formats for round-tripping.  I am beginning to understand Rick’s HTML interoperability as the floor case for ODF Text and OOXML WordProcessing documents.
      
  • Rick Jelliffe: MODUS – Minimum Open Documents Using Standards.  2009-02-26, rather deeper than the other ones, out to identify a technical approach to assured interoperability/interchange in a practically-satisfying way.  The part on Verifiability I must print out and put under my pillow every night.]

Labels: ,

2008-10-30

 

Cover Pages: W3C Multimodal Architecture and Interfaces

[update 2008-11-06 I don’t know how I failed to see that the very first sentence didn’t carry the sense I intended for it.]

A current weakness in the open-document standards arena is the poorly-specified and tacit coupling of format provisions to behavior in various document processing contexts (creation, viewing, editing/manipulation, and various “final-form” renderings and, these days, interactive performance governed by the document, whether slide-show or something more elaborate).

We’ll get to that some day, and the ways that such aspects are layered into specifications and their allowance for application innovation and conformance novelty remain to be discovered.

This Cover Pages Daily Newslink item from 2008-10-21 leads to an account of the W3C Technical Report on Multimodal Architecture and Interfaces Fifth Working Draft.  

I’m putting down an nfoWorks marker because of these intriguing passages in the Newslink:

“The main difference from the previous draft is the addition of the rules and guidelines which will allow modality experts to describe the features, capabilities and APIs for specific modality components in sufficient detail so that the components will be interoperable in implementations of the Multimodal Architecture. … The specification describes a loosely coupled architecture for multimodal user interfaces, which allows for co-resident and distributed implementations, and focuses on the role of markup and scripting, and the use of well defined interfaces between its constituents.”

I am hesitant about the following:

“This framework places very few restrictions on the individual components or on their interactions with each other, but instead focuses on providing a general means for allowing them to communicate with each other, plus basic infrastructure for application control and platform services … At runtime, the MMI architecture features loosely coupled software constituents that may be either co-resident on a device or distributed across a network. In keeping with the loosely-coupled nature of the architecture, the constituents do not share context and communicate only by exchanging events.”

There are some wise words about keeping straight the different design-time and run-time considerations.

I suspect that this is not going to bear directly on realization of the Harmony Principles, but it might provide useful conceptual underpinnings for an account of the behavioral aspects that are at least as important in document-mediated interoperability as the standard document format.

Labels: , , ,

2008-10-15

 

DITA for Technical Standards Publishing

Technorati Tags: , , ,

From the Cover Pages: XML Daily Newslink for 2008-10-14, there is announcement that the OASIS Darwin Information Typing Architecture (DITA) Technical Committee has formed a new subcommittee.   The DITA for Technical Standards Subcommittee has the ambitious purpose of furthering and promoting DITA use for the creation, maintenance, and support of technical standards specifications.  The idea is to have a “common standard for the creation and publication of … technical standards specifications:” 

“The first effort will be to assess and define common requirements for the maintenance and publication of technical standards.  This will provide the common requirements for the specific capabilities that DITA should provide.  Finally, the group will create necessary enhancements to DITA standards and deliverables, including the DITA Open Toolkit with a Toolkit for Technical Specifications.”

There is more in the announcement of subcommittee formation on the (semi-official?) DITA online community site.  The official subcommittee operation is to be set up on the OASIS DITA TC page.

I’m not sure that this has any near-term benefit, but it does arouse my interest in another way.  I am finding it very difficult to wrap my head around the current and in-progress OpenDocument Format (ODF) and Office Open XML (OOXML) specifications.  I need some way to wrestle out my understanding in a way that allows me to surface a conceptualization of the functions of either in a way that their reconciliation at the Harmony Principles level can be grasped and described in some useful way.

DITA surfaces on my radar from time to time.  It is something I think I should know more about.  I don’t know how to apply it in the context of standard document formats, nor am I clear how it is applicable to the conceptualization and expression of document-format standards.  It does strike me that some help is needed, based on my early efforts in the analysis of ODF specifications.   (At this point, concept-mapping software might be even more useful, and I will look into that as well.)

It is time to dig deeper into DITA to see how it can support a harmonization effort with regard to office document formats and their harmonizable specification.

Labels: , ,

 

Simplifying Speech-Enabled Applications

Via the Cover Pages: XML Daily Newslink for 2008-10-14, I learn that the W3C has standards for speech-enabled/-enabling web applications.  The addition announced today is the W3C Pronunciation Lexicon Specification (PLS) Standard.  This is an accessibility as well as a convenience feature.  PLS is intended to work with Text to Speech (TTS) and VoiceXML applications. 

The PLS lexicon is an XML document and there is allowance for blended use with other namespaces.  This suggests to me that there is prospective use in interchange of office-productivity documents for various purposes.

I don’t expect that this will fit into any foreseeable level of harmonized features.  I am placing this marker because it may well feature in accessibility provisions at some point, even if accomplished via a public-profile agreement involving foreign elements.

The accessibility angle is an important one to keep an eye on for its interoperability, interchange, and preservation potential.

Labels: , ,

2008-09-07

 

Document Interoperability: The Web Lesson

"are there alternatives to google groups search for searching old USENET messages? because groups date fielded search is teh broken."

-- Richard Akerman on Twitter, 2008-08-31 

Be prepared for a dramatic shift in the reality of web-site browsing and the honoring of web-page standards.   The pending release of Microsoft Internet Explorer 8 is going to put the reality of web standards and their loose adherence in our faces.  Although Internet Explorer is indicted as the archetypical contributor to disharmony on the web, Internet Explorer 8 is going to challenge all of us to deal with the reality of our mutual contribution to the current state of affairs.

Here is a lesson, probably many lessons, for document interoperability and the way that standards for document formats evolve and harmonize, or not, over time.

The Web as Clinical Science

The movement from loosely-standard pages and their browsing to strictly-standard pages and standards-mode browsing will illustrate every aspect of the same challenge for office-productivity documents and the office suites that process them. 

Web pages are the experimental drosophilae of digital documents.  All aspects of dynamic convergence on standards, themselves evolving, and the forces of divergence, are demonstrated clearly and rapidly.  I expect it to take Internet generations for significant convergence, with no static level of standards adherence anywhere in sight.  It took us almost 20 years to get to this point on the Web; I figure it will take at least five more to dig out of it far enough to claim that there is a standards-based web in existence and in practice.  I'm optimistic, considering that HTML 5, the great stabilization, is not expected to achieve W3C Recommendation status until 2012.

No document-interoperability convergence effort is anywhere close to the promising situation of the web as Internet Explorer 8, HTML5 implementations, and other compatibility-savvy browsers roll out over the next several years.  It is useful to use that situation to calibrate how convergence and interoperability could work for document interoperability.  There are significant technical barriers.  The non-technical barriers are the most daunting.  That should be no surprise.

Versioning in Document Use

I've written on Orcmid's Lair about the IE 8.0 Disruption.  This involves changes in Internet Explorer 8.0 by which web pages are rendered in standards-mode on the assumption that pages are conformant with applicable web standards.  In the past, it was presumed that pages were loosely-standard and browsers, also loosely-standard, made a kind of best effort to present the page.  The consequences have been explained marvelously in Joel Spolski's post on Martian Headsets.

We are similarly relying on document-format standards as a way to provide for many-to-many interchange and interoperability between different (implementations of versions of) document-format standards and different (implementations of versions of) processors of those digital documents.  That means we have a version of the loosely-standard documents with loosely-standard processing problem.  We can't be strictly standard because the standards can't (and definitely don't) have strict implementations at the moment; and there are many ways that specifications and implementations have been kept loose by design.  Accompanying that looseness by design is the the simple fact of immaturity among the contending document-format standards for office applications, particularly as vehicles for interoperable applications.

For office-productivity documents as we know and love them, there are five, count 'em five "official standards." 

The "Official" Public Standards of Office Documents

For Office Open XML Format (OOXML), there is the ECMA-376 specification of December 2006.  There is also the ISO/IEC 29500:2008 Office Open XML File Formats standard once it is made available.  IS 29500 will have some substantive differences from ECMA-376.  We won't have a solid calibration of the differences until the IS 29500 specifications are available and subject to extensive review.

For the OpenDocument Format, there is the Open Document Format for Office Applications (OpenDocument) v1.0 OASIS Standard issued 1 May 2005.  There is also the ISO/IEC 26300:2006 Open Document For Office Applications (OpenDocument) v1.0 standard (also on the publicly-available listing).  IS 26300 is for the same format as the OASIS v1.0 standard, but it is on a completely-separate standards progression.  Appendix E.3 accounts for the differences of IS 26300 from the text of the May 2005 OASIS Standard.  The first page of the IS 26300:2006 document (page 5 of the PDF) identifies its source as Open Document Format for Office Applications (OpenDocument) v1.0 (Second Edition) Committee Specification 1, dated 19 July 2006, derived from document file OpenDocument-v1.0ed2-cs1.odt; this is not another OASIS Standard, however.

The second and latest OASIS Standard for ODF is Open Document Format for Office Applications (OpenDocument) v1.1 issued 2 February 2007.  This document is derived from OpenDocument v1.0 (Second Edition) Committee Specification 1, the same specification that is the source of content for ISO/IEC 26300:2006.  The changes made to arrive at ODF v1.1 from the v1.0 (Second Edition) committee specification are detailed in Appendix G.4.  There are some mildly-breaking changes from ODF v1.0 to ODF v1.1, mostly of a clarification or correction nature.  There are a few additional features that have no down-level counterparts in ODF v1.0.

A third OASIS Standard, ODF v1.2, is under development.  The current drafts, using a very-different organization from v1.1, are available as pubic documents of the OASIS Open Document TC. 

We can expect to see more versions of ODF and of OOXML at their various standards venues.  We'll be watching here on nfoWorks as the situation becomes even more chaotic.  Notice that this diversity ignores the variety of divergent implementations of the various specifications.

Format Versions that Live Forever

It is possible for one document-format specification to officially supplant another, with the older specification deprecated.  That has not been done so far with any of the five-and-growing document-format specifications, any more than it has been done for most of the versions of HTML specifications that have been recommendations of the W3C (and IETF before the development track moved entirely to W3C). 

For example, the last full-up specification for HTML, the HTML 4.01 W3C Recommendation of 24 December 1999, has this to say about its immediate predecessor: "This document obsoletes previous versions of HTML 4.0, although W3C will continue to make those specifications and their DTDs available at the W3C Web site."  This was possible because HTML 4.0 was young and there were important defects that 4.01 cured.

The HTML 4.01 specification continues with the following recommendation: "W3C recommends that user agents and authors (and in particular, authoring tools) produce HTML 4.01 documents rather than HTML 4.0 documents. W3C recommends that authors produce HTML 4 documents instead of HTML 3.2 documents. For reasons of backward compatibility, W3C also recommends that tools interpreting HTML 4 continue to support HTML 3.2 [W3C Recommendation 14 January 1997] and HTML 2.0 [IETF rfc1866 November 1995 and the IETF-obsoleting rfc2854 June 2000] as well." 

The XHTML branch of specifications, originally derived from HTML 4.01, were intended as the basis for a future generation. 

Meanwhile, there has been work toward both XHTML 2 and HTML 5.0

HTML 5.0 is currently intended to exist alongside XHTML 1.x and its newer arrangements while also absorbing XHTML 1.x to some degree (by having an XML form).  The current HTML 5.0 draft specifies legacy processing (in its HTML-syntax form) for variations of over 60 HTML DOCTYPE DTD flavors, extending back to HTML 1.0 and other variants.  The intention is to converge HTML and XHTML 1.x under a consistent HTML 5 processing model with only no-quirks, some-quirks, and quirks modes.  This is also intended to end the variation and extension of HTML (not XHTML) by capturing <!DOCTYPE HTML> for its own and having a concrete HTML syntax that is fully-divorced from both SGML and XML.  It is important to point out that HTML 5 is not going to eliminate the divergence that browser (user-agent) plug-in models, plug-in implementations and scripting systems (especially client side) bring to the mix.

Document-format versions are not easily abandoned.  Even if production of a format is deprecated, consumption of the format may need to continue into the indefinite future, and certainly so long as emitters of deprecated formats have significant usage.  The W3C progression of HTML is at a point where that is fully-recognized and being honored in reaching toward an HTML 5 plateau sometime in the next decade.

Considering this promising stabilization, when would I manage to change all of my web sites and blogs to clean HTML 5 pages?  Not until I know that visits to those sites are only a small fraction of Internet Explorer versions prior to IE8 (or maybe IE9) and other browsers lacking full-up standards-mode processing.  Fortunately, the HTML 5 specification-effort promises to show me exactly how to do that in a mechanical way.  I am looking forward to automated assistance.  In my case, I'll also have the benefit of my IE 8.0 mitigation effort.  Other web sites may require other approaches, and user browser choice will involve important trade-offs for some time. 

I am surprised by the number of people who operate multiple browsers.  Although I operate multiple products for office applications these days, that's mostly to explore their interoperable use, not to ensure ability to interchange documents (well, not until I joined OASIS and the ODF TC).  I've been a serial adopter of Internet Explorer versions since IE 2.0.  As a typical late-adopter, I may finally branch out now just to have a better calibration of the migration to standards-based sites and browsers for them.

This is an important lesson for the management of the expanding variety of specifications of formats for office-application documents, formats of which HTML packagings are sometimes one of the flavors.

Reconciling office-application document-format versions does not promise to be so easy as the current effort to stabilize HTML for the web.

The Looseness of Document Specifications

Of course, OOXML and ODF are not close dialects off a single family tree, as HTML variants might be treated (and HTML 5 demonstrates, if successful).  In addition, the current specifications are not for same-conformance, interchangeable-everywhere documents:

  • There are weak conformance requirements.  It is not necessary to implement any particular amount of the specified format: OOXML or ODF.  This is by design.  I don't expect that to change.  There is also no way to indicate how much or how little is accepted and/or produced.  Well, you could look to see what software produced the document, using ODF as our example:

<office:document-meta
    xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
    xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0"
    office:version="1.2">
  <office:meta>
    <meta:generator>
        OpenOffice.org/3.0_Beta$Win32 OpenOffice.org_project/300m3$Build-9328
    </meta:generator>
  </office:meta>
</office:document-meta>

This strikes me as even less appealing than the challenge of sites adjusting for browsers and browsers adjusting to HTML DOCTYPE declarations (and their absence).
   
It is not encouraging that the office:version attribute and <meta:generator> element are both optional.  It is unfortunate that the office:version attribute is generally uninformative about the processing requirements for the document file in hand, serving merely as an automatic claim of one specification the document conforms to.  The document is also likely to conform to earlier versions and probably alter later versions, although it is unclear how we can determine that easily for a given document representation.

  • Arbitrary "foreign" elements are allowed.  I'm not clear how IS 29500 for OOXML will allow for this kind of thing, but the ODF specifications are justly-notorious for this provision (ODF 1.1, section 1.5):
      
    "Documents that conform to the OpenDocument specification may contain elements and attributes not specified within the OpenDocument schema. Such elements and attributes must not be part of a namespace that is defined within this specification and are called foreign elements and attributes.
      
    "Conforming applications either shall read documents that are valid against the OpenDocument schema if all foreign elements and attributes are removed before validation takes place, or shall write documents that are valid against the OpenDocument schema if all foreign elements and attributes are removed before validation takes place.
      
    "Conforming applications that read and write documents may preserve foreign elements and attributes."
      

    There are some further wrinkles and this proviso:
      
    "Foreign elements may have an office:process-content attribute attached that has the value true or false. If the attribute's value is true, or if the attribute does not exist, the element's content should be processed by conforming applications. Otherwise conforming applications should not process the element's content, but may only preserve its content. If the element's content should be processed, the document itself shall be valid against the OpenDocument schema if the unknown element is replaced with its content only."
      

    As a developer, I love gimmicks like this.  But, basically, this only works with processors that re-encounter document files that they themselves produced.  Anything more coherent requires that the implementers of different processors form some sort of out-of-band, separate-from-the-standard interoperability agreement on particular foreign elements and handling of office:process-content attributes.  Users, confident that their software is "standard," will have frustrating and inexplicable interchange experiences (unless the usual thing is done and everyone agrees to lock in on the same software [version], surprise, surprise).

    OOXML has a versioning scheme that might provide controlled extensions that degrade usefully when processed by implementations of down-level specification versions.  It is unclear at this point whether this is just a more complicated way to end up with the same interoperability problems.

  • Some features require foreign content.   Both OOXML and ODF have features where content is represented by a binary-data part elsewhere in the package.  There is little (OOXML) or no (ODF) indication of what the format of the binary element is and what MIME types are allowed for such document components.  All use of those features and any interchange agreements about them are beyond the current provisions of the relevant document-format standards.
      
    There are other places where implementation-defined values are expected and are expected to be preserved by other implementations.

  • Some values and default selections are implementation-specific.  I was mining in the ODF specification the other day.  I did not expect to find attributes having text on these patterns:
       
    "The value of this attribute is implementation [or application] specific."
      
    "If this attribute is not present, the application might or might not display [whatever]."

    These are relatively minor considering the amount of variability from the other conditions already mentioned.  What's curious about these is the elevation of particular implementation-specific features as specification-favored. In the case of implementation-specific attribute values, there is also the interesting problem of a processor determining whether such a value is intended to have its implementation-specific interpretation or not.  It appears that the related features will only be useful under tightly-restricted interchange conditions.
      
    I will not be surprised to find similar looseness in the OOXML specification, IS 29500.

Prospects for Interoperable Convergence

We already have before us difficulties with interoperable convergence of individual progression of a single standard and its variety of implementation.  This makes the prospect of harmonization between different standard formats rather murky.

Desktop office-application software has more promise with regard to application of Postel's Law, to be liberal in what is accepted and conservative in what is produced.  Unfortunately, the current specifications do not require conservative, interoperable implementations; the current specifications are arguably antagonistic to such an achievement.

I suspect that this is an unintended consequence mixed with some inattention to what it takes for interoperability to be achievable. 

It remains to see how our experience and understanding matures.   We are at the beginning, not the finish.  The journey may seem endless.


The process of IE 8.0 mitigation and preparation for a standards-mode approach to web browsing impacts this site and blog as well as every other web page I have ever posted (somewhere over 120MB worth and climbing).

I'm not going to say anything more about IE 8.0 mitigation and HTML harmonization here.  The overall effort will be tracked in that category of Professor von Clueless posts; that's the place to follow along.  The lesson for document interoperability is something that is definitely appropriate for Pursuing Harmony; there'll be much more to say about that.

Labels: , , , , ,

2008-08-19

 

Agreement on Document Rendering

One problem for harmonization of document-format implementations in the case of ODF and OOXML is the degree to which those specifications may provide inadequate specification of how documents are to be presented in order for implementations to be adequately interoperable.  This situation arises between implementations of the same format as well as between different document formats.

Along with establishing clarity on how much agreement in presentation is required, there is the small matter of being able to somehow confirm that an application achieves whatever that level of conformance and interoperability is.

As I suggested in "Interoperable ODF: Finding Ground Truth," there are many difficulties to be conquered in advancing from the current state of affairs to where there is reliable determination that implementations are substitutable in a particular interoperability setting. 

We don't have good ways to even talk about the multiple, interlocking problems that lurk beneath the simple desire to have interchange of documents in standard formats "just work."

It would be helpful, for starters, to at least have a way to describe what a particular document-processing system does in rendering documents that it accepts.  A checklist on the handling of particular features of an electronic document is also useful in determining conformance and interoperability guidance and, perhaps, eventual mechanical verification criteria.

The XML Daily Newslink for 2008-08-18 reports on a contribution that may be useful in this regard, the "First Edition of ISO/IEC 24754: Minimum Requirements for Specifying Document Rendering Systems."  From the scope:

"This International Standard provides an abstract list of the features that a document rendering system may have, thus providing a frame of reference, against which the user and implementor can compare the  features of a document rendering system. However, this International Standard does not direct how each document rendering system should behave.

"This International Standard provides the minimum requirements to specify the features that a document rendering system which transforms formatting objects to rendering output. It may be used as a frame of reference, against which the user, implementer, or software agent may compare the features of a document rendering system.  According to these requirements, the user may express what he or she expects of a document rendering system, the implementer may describe the functionality and capability of the document rendering system that he or she implements, and the software agent may negotiate a minimum set of functionality and capability that are shared across different document rendering system implementations."

The Final Committee Draft (omitting the example in the informative appendix) is available as a 7-page, 74 kB PDF file

Beside the announcement, here's what attracted my attention:

  • The list of features is abstract.  The descriptions of how individual feature are handled would normally be in prose, with possible reference to standards applicable to the feature (e.g., font substitution).
      
  • The document assumes an abstract processing model related to that for SGML and XML.  The presumption is that rendering is specified separately from logical document format and content.  It can, of course, be adapted to the rendering provisions of processors for electronic office documents even though rendering conditions are embedded in the specification for the document format and its elements.  So it is definitely minimal and something to start from.
      
  • This work was initiated in 2005.  The Final Committee Draft download file is dated 2007-10-29; its electronic document was last edited on 2007-01-23.  The standard, ISO/IEC 24754:2008 was published on 2008-08-15.  It costs 96 CHF (about $100 US) and is not (yet?) on the list of publicly-available (i.e., free) standards, last updated on 2008-08-08.

Labels: , , ,

2008-08-18

 

Interoperable ODF: Finding Ground Truth

Jesper Lund Stocholm has found his files from the Microsoft Document Interoperability Initiative ODF Workshop.  His post, "DII ODF Workshop - the good stuff", shares the nitty-gritty on-the-ground experience of transferring ODF documents from OpenOffice.org to Microsoft's pre-beta Office 2007 SP2 implementation and back again.  There's a download of eleven test files, each in two forms, along with PDFs of how they render.  There's an OpenOffice.org version of each document.  Then there's the Microsoft Office 2007 SP2 pre-beta ODF saving of the same document.  This is enough to discern how the the two applications handle application-specific features from other applications and express application-specific features of their own.

There are some great lessons becoming available with regard to interoperable use of document formats.  Here's what I see in terms of the Microsoft Office and OpenOffice.org implementations of ODF:

  • Being standard is not the same as being interoperable. 
    Lund Stocholm points out, "The result of the validation is that all files generated by Microsoft Office 2007 SP2 are valid ODF 1.1-files."  The validation is essentially syntactical and that is not going to deal with all of the tolerated implementation variability, semantic bugs, and need for out-of-band agreements where the specification is (purposely and perhaps valuably) left wishy-washy.
       
  • There's a tremendous amount of binary information packaged in OO.o 2.4 and Office 2007 ODF document implementations.
    This information is carried in outside-of-ODF namespaces and MIME types for which there is no mutual agreement.  This can be reconciled among the different implementations, and we might expect more harmony before Office 2007 SP2 ships, assuming there are no intellectual-property difficulties not covered by existing non-assertion covenants.  This is a tricky area with socio-political and competition-law ramifications (illustrated by how no one seems to be bothered by the amount of binary material used in OO.o's implementation of ODF).
      
  • ODF-specification versioning is going to bother us for years, if not forever.
    Version churn is going to be a serious problem until those able to insist on demonstrable interoperability among applications compel some rational process for dealing with specification and implementation incompatibilities and defects,  The stakes are now raised for achieving useful up- and down-level accommodation of specification and (deviating but widespread) implementation versions.  Although I can see no way the ODF spreadsheet-formula problem could have been avoided, in particular, we must face two painful situations:
    • XML namespaces for ODF are not dealt with as contracted interfaces with explicit discrimination of additions and changes between versions of the specifications.
    • Requiring private agreement on spreadsheet formulas through at least ODF 1.1 is going to force dealing with at least three versions in the future, something like
      - a Microsoft Excel formula namespace (better: an ECMA-376 or IS 29500 one),
        xmlns:msoxl="http://schemas.microsoft.com/office/excel/formula"
      - an OpenOffice.org formula namespace,
        xmlns:oooc="http://openoffice.org/2004/calc"
      - the default ODF OpenFormula namespace when finally introduced into ODF
      - versions of the above with their individual defects and incompatible implementations
        
  • It's the application [stupid?]
    People don't deal with formats and the nuances of format versions, allowed options, and private agreements.  People deal with software and the quality (and fidelity) of the electronic document that the software provides.  Expecting individual users to be self-consciously attentive to limitations on conformance and interoperability is even more hopeless than demanding meticulous adherence to security policies and practices in ordinary office work.  What people do want is for their interoperability case (however articulated) to just work.  In reality, even "Save as ..." is asking too much.
    • The first part of this lesson is going to involve recognition of the degree to which end-users are going to address interoperability by choosing specific software and believing interoperability is achieved, the ever-popular solution.
    • The second part of this lesson is recognition of the distance between the current state and one with broader interoperability and confident substitution of alternative software choices.  The differences among major ODF implementations will reveal how easy it is to lose interoperability while conforming to the current specifications.
    • Ultimately, we may have to accept that we are unwilling to pay the price for significant interoperability assurance except under extraordinary circumstances.  The "cost of interoperability" debate is ahead of us.

I don't foresee the Harmony Principles alleviating this situation in any way.  At best, I expect it to help us appreciate the cost of interoperability and its improvement over time.

Labels: , ,

2008-08-12

 

Ramping Up

This is an initial post for ramping up the Blogger creation of Pursuing Harmony. The blog will replace the manually-created diary that was first used with nfoWorks. This initial page is created to cause population of the blog, its archives, and the RSS feed.

The blog starts out using Jason Sutter's "Sand Dollars" format, one of the supplied Blogger templates. I will customize that to achieve my preferred style once operation of the blog is confirmed.

Customization happens after blog editing and posting is confirmed to be functioning.

My first problem was attempting to use SFTP. I reverted back to plain-old FTP to make sure the FTP log-in could succeed. This wasn't working the same as previously, so I tried a different path name.  The problem may have been a lag in set-up of the account that I created for Blogger to use in posting to the web site.  After one false-start, it all started working.

1. Update 2008-08-12T21:55Z After succeeding with this initial post, I decided not to use the Blogger NavBar.  If I want to add a search panel for searching the blog and/or the site, I will do that later.

On getting Windows Live Writer working with the site, I attempted to establish an upload directory for images included in the blog.  LiveWriter reported that it could not confirm the URL.  I will make sure the new directory is established and give it a construction structure.  I should then be able to have LiveWriter use it.

I am also going to move the archive directory out of the top-level blog directory, so that backup and maintenance is easier. 

Live Writer reminded me to add tags to this post, and I will do that now too.  I see that the categories work differently and are independent of tags.  I must not be so careless in choosing categories.

2. Update 2008-08-12T23:34Z The archive is moved and the image directory was pre-populated with an index.htm so that Live Writer's concern for the folder was removed.  It is now time to demonstrate that picture uploading works:

3. Update 2008-08-13T00:03Z After I figured out what time it is here (for adjusting to GMT in these little update items), I also managed to correct the URL for where the images are so that, not only are they uploaded, they are also correctly-reachable from the blog post.

My next effort is to make sure that I have a back-up procedure for mirroring the blog content on my development machine along with the other parts of nfoWorks.

After that has concluded, I will need to start working on the template.

Simon's American cousin, the romantic Edwin 4. Update 2008-08-13T00:12Z Well, the Neighborhood Rose images are on the blog site, but Live Writer fails to notice that I corrected the URL in my FTP settings.  So I am linking in the above picture manually.  I will introduce an additional image so that it can upload with the new settings.

Now I can introduce the backup procedure and then start adjusting the template.

5. Update 2008-08-13T00:53Z One concern I had is the automatic creation of subfolders to hold images from separate posts.  I have verified that no subfolder index is available on the server and there is no way to explore the images but via image elements on the blog pages.  That is satisfactory.

6. Update 2008-08-13T20:25Z I have carried out one manual backup.  Before taking the customization to the next level, I have made a couple of simple touch-ups. 

  • The title is changed from "Pursuing Harmony" to "nfoWorks: Pursuing Harmony."
      
  • The little e-mail option down near the comments and links line after each blog is removed.  The e-mail content is so ugly, devoid of formatting and any images, that I prefer people use other means off of the RSS feed or by using their browser to clip and forward pages.

The next step is to construct the safety net and fire-drill procedure for the blog.  It has been some time since a blog has been corrupted, but that is no reason to start operating without a net.

Labels: , , ,

 
Construction Structure (Hard Hat Area)
Creative Commons License You are navigating nfoWorks.
This work is licensed under a
Creative Commons Attribution 2.5 License.

template created 2008-08-13-18:06 -0700 (pdt)
$$Author: Orcmid $
$$Date: 13-11-11 19:13 $
$$Revision: 5 $