tools for document interoperability
|nfoWorks > Diary > Hard Hat Beginnings|
.xlsx, and ODF .ods formats. Using Office 2003 and OpenOffice.org 2.4, the OOXML and ODF formats took longer to load than the native binary formats of the corresponding applications. Oddly, even though Office 2003 uses a compatibility pack to load the OOXML version of the test spreadsheet, it loads that spreadsheet (which is larger than the corresponding ODF on disk) faster than OpenOffice.org loads the ODF version.
But the horror story was the loading time of the ODF version into Excel 2003 via the ODF Translator plug-in. This took almost 10 minutes in contrast to the 14 seconds with OpenOffice.org.
Although this may tell us far more about the implementation of the ODF translator, Weir argues that there is also substantiation of the inherent limitation of translators over filters (which go to an useful intermediate memory structure rather than completely into another recorded format).
This is an useful experiment and we need to understand them better, creating similar performance tests that apply to nfoWorks approaches.
The test files were obtained from a 2005 George Ou article (cache).
This sort of detail should be eye-opening for those who think strictly in terms of left-to-right writing of texts based on a Roman alphabet.
As an interoperability concern, I find conformance of layout to be particularly daunting and something that it would be great to be able to ignore in terms of harmonization of document formats. That is, of course, not possible if I am to take the Harmony Principles seriously. There is an impact on the applications and how users see the proper layout of what others and themselves have written. I remain daunted.
This is exactly the level that someone wanting to make section-level ToCs needs to understand to accomplish this in a document. I have wanted to be able to do exactly this, so the mini-tutorial caught my eye at once. It also illustrates a number of considerations around cross-product/format document interoperability.
First, there is no discussion of how this ends up being expressed in OOXML. This is not, of course, the user's principle concern.
Secondly, there is no discussion of how this might be handled in an interchange and collaboration situation (even when the collaborators are all using the same software product [version]).
Third, this particular magic is accomplished by a combination of particular, independent features. It is easy to see how the developers of another product might not have done what is necessary to anticipate this case, even when also accepting OOXML.
Fourth, the translation to a different document format (i.e., ODF) of this particular document might only be possible by resolving model incompatibilities. That is, what is being done (and visible only through the OOXML being interpreted) needs to be abstracted out enough so that the same effect in a different model is achieved when it is not achievable by some naive one-for-one feature translation. This is the kind of thing that involves global analysis, inference/backtracking techniques, and recognition of particular use cases that we tend to expect of expert systems using various AI-programming techniques.
Fifth, the prospect of accomplishing the kind of transformation in the fourth level in a round trip back to the original format is near inconceivable.
Finally, I don't expect the Harmony Principles to have this kind of reach in any near-term future. At the same time, I wonder how easy it will be to detect this sort of thing in a harmonizing document processor or will we be left with ham-handed and still complex rules (i.e., no multiple ToCs in documents) that are more complicated than at the individual feature occurrence -- there must be some contextual knowledge of other aspects of the document in order to recognize a disharmony in a benign-appearing local element.
So, here's already more food for thought than I thought this interesting example would provide.
I hadn't realized that Daisy involves its own XML format and that documents need to be translated to the Daisy XML. This raises another interesting harmonization case and I will have to look into it more deeply.
There is also Daisy translation for ODF and that will be an interesting basis for comparison and broader understanding of how all of this is intended to work.
I have been operating under the obsession that the backlog should be rolled forward from its oldest to the newest, so that the accession numbers that I use on files (see the cache catalog) will be chronological.
The "duhhh" moment is reminding myself that the whole principle behind using accession numberings is the ease of claiming current material and then filling-in (accessioning, if you will) the backlog as time permits. The idea is to close off the backlog so that more is not created, and the backlog is cleaned up in any order that works while staying current with the latest.
So, slap on the head and I will now get over that and start with current and fill in the other backlog at my leisure. The accession numbers don't have to be chronological with the date of the source material, they are chronological with the sequence of accession, and that can be out of time sequence to overcome an obsessive blockage (inappropriate perfectionism, perhaps).
I intend to switch from Hard Hat Beginnings to Hard Hat Era once I eliminate the backlog and have more useful content in the notes compilations. So we are extending into May. ...
There are some glitches in Spaz, but I like the ease with which I was able to install and setup. I also like the look of the application. This makes AIR another consideration for test applications, demonstrations, and nfoWare fixtures along with standalone Silverlight.
One comment on the post asks for a lower-level implementation that ports to Linux, as a libODF.
It is interesting that LGPL v3 is claimed as the license. I could not find a license statement in a quick look at the JavaDoc download, nor is there a license file included in the runtime (.jar) package. There is an XML comment in an identity.xsl resource file that is from OpenOffice.org. The XML comment mentions LGPL v3 and refers to an absent LICENSE
Meanwhile, the availability of this model and a Java API for it is extremely valuable as a basis for exploring the prospects of a harmonized DOM.
The debate strikes me as a demonstration of the kind of back-and-forth that occurs in solidifying our understanding of what a standard specification requires and how implementers can disagree until interoperability considerations require some mutual accommodation and community-wide clarification of what the requirement is. A specification may also be revised to eliminate the particular misunderstanding, however it is resolved.
There will be further intense scrutinization of DIS 26300 (ODF) and DIS 29500 (OOXML) separately and in terms of their harmonization. Meanwhile, setting aside the bickering and indignation that is expressed in the exchange, the blog entries and their comments are worthy of review for how issues of this kind are sorted out or at least taken to a point where the dispute is abandoned for now. This is also a demonstration of how important public conversations and analyses are to the strengthening of understanding and ultimately the specifications:
- Rob Weir: Achieving the impossible (cache). 2008-05-07 post demonstrating executions of two schema-validation programs using the ODF Schema and the ODF Specification. This seems to be the end of the discussion at this point.
This mechanism and its support by "dark mirrors" would be a valid alternative to the private embargoed collections that I include as part of the caching of cited material here.
This article explains the motivation, the mechanism that is used, and the relationship to other approaches that are in use.
I'm not changing the cache approach for nfoWorks just yet (one motivation being the preservation of an off-line copy that I can always access). It is useful to have WebCite for explaining the motivation for the caching of material and also for the embargo of that material for which redistribution has not been explicitly granted.
There is also a backlog of material since 2008-04-17 that merits some diary entries. I am starting that today.
This is a way to do some rapid tooling of document fixtures and tests where the .NET or Mono runtime is usable. The application is essentially running an embedded server that hosts Silverlight locally, it would appear. Still, that's potentially useful. It should even run Popfly applications, methinks.
I mention this to put down a marker for prototyping and demonstration prospects. If I can minimize GUI development as part of establishing nfoWorks libraries and fixtures, I certainly will.
The GullFOSS blog provides regular accounts of development activity for OpenOffice.org releases. There are periodic "development at a glance (cache)" posts that provide a snapshot of OO.o development activities, including those related to OpenDocument format features and planned synchronization with (anticipated/pending/committee-approved) updates and additions to the ODF specification. This is where I go to learn that there is something new to find on the OpenDocument TC site.
Michael Brauer (cache), currently the co-chair of the OASIS OpenDocument TC (cache), contributes here.
Mathias Bauer, another OO.o developer, has just posted a great "ODF Enhancements for OpenOffice.org (cache)" article that describes how ODF changes make their way into OO.o, with an interesting guide to how public suggestions make their way into the ODF specification.
Robert Weir (cache), the other OASIS OpenDocument TC co-chair, follows-up on Mathias Bauer's post with further guidance on "Suggesting ODF Enhancements (cache)." There is a careful explanation of the measures taken to assure that suggestions are useable by OASIS and don't create intellectual-property problems for the specification.
The information on ODF discussions and contributions is valuable to have before contributing and exploring the available ODF discussion lists.
The idea is that so long as attribution requirements are satisfied there are no copyright-based constraints on the use, re-use, and derivative use of nfoWorks deliverables. These licenses are compatible with GPL, the GNU Public License (but not vice versa). These licenses are compatible with the creation of proprietary, closed-source software (but not vice versa).
Attribution is important as part of the provenance and accountability that is expected as part of responsibly-built software that incorporates or derives from nfoWorks deliverables. It is important to identify the dependency so that any announcements of defects and security flaws in any version of nfoWorks deliverables can be checked against the dependencies in works that incorporate those deliverables. It is a simple matter to always know the dependencies there are on particular versions of other software, and to demonstrate that by making the dependencies known in a way that is available for inspection.
The same conditions are honored in any incorporation of other software in nfoWorks deliverables.
Because the GPL is not compatible with the BSD Template license and its brethren, examination of GPL'd source code is avoided. Although there is no aversion to relying on GPL's utilities and tools, GPL'd source code is not touched except for the unlikely case of submitting defect reports and repair suggestions back to the authors. Proprietary, closed-source programs, even if used in the nfoWorks laboratory, are not redistributed and only freely-available software is required for making use of nfoWorks source codes.
Finally, these license considerations apply to copyright on software. In the event that a patent covenant is known to apply and might be unavailable for a different use of the code, precautionary notices will be attached to the software and incorporated in the source code.
This policy has been expressed in various forms from time to time. It seems like a good idea to clear the air and summarize the considerations in this one place. -- dh
There is a tension between the desire for rapid development, easy experimentation, and having higher-level ways to use components versus dealing with performance, code footprint, and platform portability while traveling light. I don't mind throw-away demonstrations and exercises, and these are useful for others to see how to start getting their arms around a technology, too. I do mind having to do big resets and do-overs that lose the benefits of progressive improvements and knowledge-building. I don't mean to exclude refactoring born from experience, but refactoring is on code that is worthy of keeping. The tension I feel around saying how it will go in advance will probably dissipate when I get into the work, so I will now wait to see how I can shape this in practice.
I think support for rapid use, and the use of higher-level development stacks (Java, .NET, XML transformation tools, etc.) is valuable. I'd even consider EcmaScript and languages like Python in this picture. These are great for demonstration, rapid trials, tests, and even reference implementations.
At the same time, I envision progressive refinement of lighter-weight, lower-level solutions that probably involve C, C++, and binary COM interfaces, the latter as a way to provide contracted interfaces that are easily coordinated properly beneath high-level wrappers, whether Java, .NET, or something else (Gnome, for all I know).
This reminds me of a conversation back in the 70's on a project where we were inventing "middleware" four ourselves. We started thinking of segregation into underware, middleware, and outerware. I want to support all three, possibly with different tool sets, and my attention is mostly on the first two. It is entirely understandable that users of nfoWorks results will be working mainly in the third category. -- dh
I suspect this will not be so worrisome when I sit down and compile sources and resources, so that may be the next nfoWare Note folio that I initiate.
There seems to be some sort of birth pangs, and I am not clear why there's not much discussion. It may be the setting, it could be something that people have about Microsoft, and it could just be that people who are seriously interested in interoperability (at least two of us) haven't found their way here. There are other possibilities.
I would like to see these forums thrive and be lively, but it may depend on who is willing to speak up and find value in forum-based conversations. -- dh
Today, there are new blog posts on recent events and the current state of nfoWorks. If you have comments about anything you see here, those posts are a good place to leave them:
OOXML + ODF: ISO Steps In, Orcmid's Lair, 2008-04-11 (cache)
nfoWorks: What Are those Harmony Principles, Again? Professor von Clueless in the Blunder Dome, 2008-04-11 (cache)
1. The main SC34 web site (hosted in Japan by the SC34 Secretariat)
2. The page for accessing public SC34 documents (rewards exploration)
3. A reminder that TrueType is connected with the ISO/IEC Open Type standard, maintained in SC34 (and relevant for nfoWorks) with a hidden reminder that getting Asian scripts right is probably one of the best demonstrations of harmonization going.
4. An useful sketch of SC34's interests and responsibilities
5. Another reminder of my own armchair critic status, something I am working to alter
6. A discussion of the criticality of accessibility considerations and the resources that apply in the work of SC34; a topic that it will be essential to address with regard to harmonization
7. An injunction to become involved and where to do that (OASIS, W3C, Ecma TC45, the national mirror of SC34 in your neighborhood, etc.)
8. Links to the DIN NIA-34 update on the harmonization investigation (PDF file, cache), great work that nfoWorks should align with
9. An interesting side comment about the use of topic maps to present ODF-OOXML mappings (although DIN is focused on translations, not mappings, because of a number of issues that translation surfaces, including round-trip degradation in collaboration scenarios)
10. Another side comment on how the concern for synchronizing ECMA versions and SC34 versions of OOXML might be extended to the case of OASIS and ODF as well.
SC34 proposes to create ongoing activities to carry out its responsibilities:
1. IS 29500 (OOXML) maintenance
2. IS 26300 (ODF) maintenance (pending OASIS agreement)
3. Harmonization (with a proposed work-item expected from the DIN NIA activity)
To start things off, two ad hoc working groups have been created.
The first ad hoc working group will propose how IS 29500 maintenance should proceed, producing a proposal by 2008-09-01, one month prior to the next SC34 meeting. This ad hoc group is chaired by Alex Brown who will lead a two-day meeting in London this July. Participation is from SC34 member bodies and I take it that ECMA TC45 members are invited to chime in.
The second ad hoc working group is being created to capture technical comments on IS 29500 and make sure existing analysis is not lost. Within 90 days (by July 2) there will be a mechanism in operation "to compile a list of comments on ISO/IEC 29500 received from NBs, liaisons, and the general public" and then to "publish the on-going list as an open document on the SC 34 website."
In the resolutions from the meeting (cache), I note that the final text of DIS 29500 has already been created. SC 34 requests distribution to its members no later than May 1. I don't know what the delay will be before publication as IS 29500:2008 happens, and I'll beg a copy of the final DIS 29500 before that just to make sure I don't step into some element of harmonization that is impacted by BRM-approved changes (especially the various conformance statements that are new in the final text). Also, to make any contributions to identification of defects, it is important to reference the most-authoritative available documents.
Here's what it looks like for intercepting DIS/IS 29500 activity:
1. Usable final text available in May for provisional use (if it can be obtained) until official IS 29500 editions are issued
2. Mechanism for receiving defects and related comments on IS 29500 operating in July.
3. In September, 2008, SC 34 meets in Korea and takes next steps, with meetings every six months (figure March 2009 in Prague, September 2009 in U.S., then 2010 meetings in Sweden, then South Africa).
Working groups that will be doing the technical work are yet to be set up and they will have their own meetings, conference calls, and mailing lists as well as ones synchronized with SC34.
(I have a current passport with lots of room for visa stamps. Now I just need a sponsor for expenses/subsistence and a national body to nominate me to a committee. Hint, hint.) --dh
This comes under the umbrella of supporting
organizational activities that intersect with standards
for document formats. This entry is a placeholder
before there is organized material on those activities.
When the time comes, I will need to draw some fine line
between pure-advocacy activities versus constructive
development and adoption support for interoperability
Focus is on addition of value via composite,
collaborative applications that integrate with the
Microsoft Office System as a platform. Featuring
OOXML involves some specific support for the format, its
automatic use, and, desirably, reliance on
custom-content and niceties for all of this that are
part of Office 2007. There is a "Featuring OOXML"
mark being encouraged for adoption in product materials
I imagine that one measure of harmony will involve consistent presentation rendering of documents by allegedly harmonious processors. The "renderance (cache)" example on Dean Allen's Textism blog is an example of the not-always-subtle practical difficulty of determining rendering fidelity across platforms and processors. There is much to grapple with in terms of specified behavior and in terms of implementation glitches, identifying the deviants in the game, and profiling around the pot holes.
My first thought is that the ability to produce identically-rendered PDF output is one kind of test that might be mechanically verifiable, bolstered with some experience-based, craftily-composed test cases. This is a second-order kind of fidelity, tied to printing models and their pipelines, and we still have to deal with differences of screen appearance even when printing seems to come out "right" (i.e., harmonious-enough). But PDF, now itself standardized, seems like the most viable stake in the ground for the moment. (Ultimately, we might throw a standardized XPS into the mix as a kind of honesty check between available final-form renditions and how harmonious products achieve agreeable fidelity.)
This is not an easy problem and it won't be swallowed whole. There will be serious temperance through reliance on an extensive progression of the least things that could possibly work at each stage. I think that applies to the strictest fidelity that can possibly be verified, too. --dh
For example, ECMA-376 for Office Open XML File Formats was the same as DIS 29500, the draft that was proposed for standardization under ISO/IEC JTC1. But IS 29500 will be a very different beast, altered from ECMA-376 as a result of the Ballot Resolution Meeting that preceded achievement of approval.
As another example, the OASIS Standard for Open Document Format (OpenDocument) v1.0 of 2005-05-01 was submitted as DIS 26300 and approved via the PAS process at ISO/IEC JTC1. But the issued ISO/IEC standard, IS 26300:2006, is the OASIS Standard for Open Document Format (OpenDocument) v1.0 (Second Edition) Committee Specification 1 of 2006-07-19.
To illustrate the complexity of the kinds of meanderings that occur, ANSI/X9 X9.100-181-2007 Specifications for TIFF Image Format for Image Interchange is a special adaptation of TIFF for the exchange of the images of bank checks among financial institutions. The images are in Group 4 bi-level encoding and the TIFF 6.0 specification is referenced in the abstract. That makes it sound something like TIFF/F, which was standardized by CCITT. I don't know what the references in X9.100-181 are, but to my knowledge TIFF 6.0 is not under the management of any standards organization. The specification was issued on June 3, 1992 by Aldus Corporation, which held the copyright until it was inherited by Adobe in a subsequent acquisition. [The Adobe rebranding of the document preserves the technical content and even the cover date, but the front-matter is modified, including elimination of the names of external contributors, a matter of some personal interest. --dh].
So that dependencies can be tracked and the correct materials understood for references from other specifications, I propose to catalog and capture specifications in the following ways:
1. There will be a separate sequence of pages (a folio in the nfoWare organization of web materials) for the progression of specification versions under a single authority (e.g., OASIS, ISO/IEC JTC1, Ecma International, IETF, W3C, consortia, and proprietary authorities such as Sun Microsystems and Adobe). The dependencies can be cross-referenced among the different sequences for specifications of interest for nfoWare.
2. There are standards development activities that one might want to track for relevant background. It is not of interest for nfoWare to provide a historical account. Interest is in the authoritative editions of specifications. However, the development activities may provide important resources for questions, discussions of points of concern and clarifications that may be important to achievement of harmonious interoperability. [added 2008-04-07: We certainly want to know about errata and there may be current items of work that need to be looked at with anticipation.] A separate folio is used for any tracking of an individual standard development activity, contact information, availability of archives and discussions, etc.
3. Available resources for implementations of a given specification, including test suites, conformance-verification tools, samples, reference implementations, and translation/conversion aids will be catalogued in one or more separate folios, probably by platform as well as particular standard.
This is still sketchy. We'll try it out first with ODF materials and then with OOXML materials and refine things as we go. --dh
1. File streams and their manipulation at the octet-data, binary level
2. Containers, such as Zip files, with their items, directories, and the various uses of compression, digital checks and signatures, and encryption
3. Character encodings using single- and double-byte encodings, Unicode
4. XML and its processing and transformation technologies
5. Other, similar technologies and formats that are employed with documents but are not confined to documents (e.g., image formats)
These are not directly about ODF or OOXML. They are relied upon instrumentally, but they are not a central focus of nfoWorks. They are, however, relevant more generally as nfoWare (and DMware, in the case of XML).
Attention to foundation elements will be jointly covered in a way that the use on behalf of document harmonization is featured here, with the general treatment under nfoWorks.
I could use de.licio.us, but that actually takes longer, needing to leave my Outlook list of unread posts to go to the actual web pages. Also, that doesn't get the information closer to wear it belongs, which is somewhere here on nfoWare.
I could put a de.licio.us feed here, and I might do that, but it isn't a way to capture the material. I need a way to capture, here on the site, and also have a way to publish it as a feed for anyone who cares. (It would be useful to do that with this diary too, and I have been short-sighted about that.)
I could also finally bite the bullet and put up a wiki. The hosting service for nfoWare has a setup for MediaWiki, my favorite. I will consider that, especially as a way to be more inviting for community participation.
Thinking about it right now, I do fancy the idea of editing directly into an RSS feed so that it is web-presentable and can be subscribed to.
There are more cases to consider, including use of AtomPub, applying Windows Live Writer to the task, and so on. My pending infrastructure update and use of a Windows Home Server may also provide further opportunities. There may be an application of social software here.
Still pondering ... --dh
I have learned that Thomson Scientific operates TechStreet, a "World Standards Marketplace" promising access to the world's largest collection of industry codes and standards.
Techstreet is apparently a reseller. I am not that thrilled about purchasing standards in this way, but the Techstreet site and their e-mail newsletters do provide useful information about newly-published standards around the world. Featured recently,
- ANSI/X9 X9.100-181-2007 Specifications for TIFF Image Format for Image Exchange. This covers a specific application of TIFF used in the exchange of check images among financial institutions.
- ISO/IEC 12207:2008 and 15288:2008 Set: Systems and Software Engineering - Life Cycle Processes
There are ones that will be pertinent to nfoWorks, and I'll need to find much more economical access to those. TechStreet is an useful way to learn of specifications as they become available. --dh
Opponents of OOXML have extensive compilations of defects. I am not interested in the attitude and point-of-view that leads to these materials, but I do want to keep an eye on them as pot holes to look out for on the highway to Harmony. Here are a representative few that I have in my collection. Their occurrence here does not mean that I concur with the claims about defects nor with the significance of the alleged defects:
1. Reuven Lerner: OOXML: Why Is It Bad, and What Can We Do? Blog entry, OStatic, 2008-04-02 (cache)
2. Rob Weir: OOXML's (Out of) Control Characters. An Antic Disposition (web log), 2008-03-24 (cache)
3. Rob Weir: How Many Defects Remain in OOXML? An Antic Disposition (web log), 2008-03-18 (cache)
[update 2008-04-03: Erika Ehrli's April 2 "Happy News for Open XML Developers (cache)" provides an additional list of resources and links, including ones not listed by Mahugh.]
If there are no formal appeals from any national bodies in the next two months, we should expect to see availability of the IS 29500:2008 some time this Summer, providing a clean version with all approved changes.
In the meantime, nfoWorks will rely on ECMA-376, the ECMA Standard for Office Open XML File Formats. I'll keep an eye out for areas that are likely changed in IS 29500. There is much to do before that becomes a serious concern. --dh
For here, the sticking point was the unfinished repaving. I also noticed that I needed it to become April, so I could easily create more folios on nfoWorks technical matters. There was a silly obstacle. In my web-site development methodology, it is easy to use pages in one folder as boilerplate for pages in another folder. It is harder to use pages in the same folder as boilerplate for more pages. Now that I am starting the notes folder for April 2008, I can use material from the March 2008 folder. It is that simple.
Of the dependencies noticed in my 2008-03-30 entry, there is one more of an infrastructure and plumbing nature: I am also going to upgrade the SOHO development systems and add a shared server for backing up and then archiving all nfoWorks and other development materials.
I am now more empowered to start collecting materials and cataloging the resources needed to begin Harmony Principles experimentation. --dh