The XHTML 2.0 documents processed by xhtml2to1 are linked together differently from documents based on traditional SGML DTDs (such as DocBook), and linked more like HTML web pages.
This section explains the rationale for this design.
For comparison, let us take an example DocBook book, containing some chapters and sections within them. We might organize our DocBook XML files like this:
So everything is conceptually under one XML document tree, that happens to have its fragments spread over multiple external entities.
By contrast, Web pages do not work that way. Instead, each file, say chapter1/sect1.html is a complete XML document in itself. Hyperlinks are made to the other sections of the book by URI addresses.
The chief advantage of the DocBook model is that it is simple and straightforward. If you want to make a cross reference from one part of the book to another, you simply attach an ID to the element where the cross reference should go, and refer to that ID at other parts of the book.
But there are problems with the DocBook model as well. What if you want to link to an external document, something you did not write? You cannot use IDs any more. Certainly this situation will occur in your document; nothing exists in a vacuum.
DocBook can link to external documents with its olink
facility.
Unfortunately, understanding how to use olink
requires an entire chapter in
Bob Stayton’s DocBook XSL: The Complete Guide.
That is clearly too hard,
considering that in HTML Web pages,
if the document you are linking to exists
on the Web,
then all you have to do is copy its URL
and put in an HTML a
tag.
It is true that olink
is a more powerful facility
than the a
element,
but probably we do not need all of the complexity most
of the time.
It is better to start with the basic HTML a
element, and then build more optional features into it.
Another problem with using external entities or XInclude to incorporate sections of a book into one document is that the document can get big. This fact is undesirable in light of the fact that XSLT processors must load all of a document into memory before processing. And it is likely to be slow.
With modern computing capabilities, this seem to be a moot issue, but consider the situation that we want to serve XML documents directly to Web browsers (slowly becoming a real possibility). Obviously, we want to avoid loading in one-megabyte DocBook XML sources just to display one page of that document.
Or consider writing programs by XML-based literate programming. The program may be on the scale of ten thousand lines of code or more. If the literate programming document is all in one piece, then everything has to be processed to build the program. But we expect to be able to make a change of a few lines in a computer program source, recompile it quickly, and see our results immediately.
Consider that programmers have had incremental compilation and dynamic linking facilities available for a long time; we certainly do not want to take away this ability when writing in XML. In fact, the usual practice of not processing large XML documents incrementally ought to be considered as an anomaly.
Unfortunately,
we cannot just answer all the problems with DocBook-style ID-based linking
by saying “just use URLs and the XHTML a
element”.
If we use the a
element alone,
we are basically forced to key in all the links
and cross reference labels manually.
We also lose automatic numbering of sections.
These things, of course, we expect a computer program to do for us
automatically.
As the author found out while trying to write xhtml2to1, a solution to these problems is not at all trivial. At this point, we would be wise to question whether we should continue to use ID-based linking despite its problems.
Thus, the questions we are to consider for the ID-based model are:
To resolve problems just raised, while still remaining in the ID-baed model, we could consider using XML databases. They can store many document trees of large size very efficiently, and retrieve and manipulate small parts of the tree efficiently too.
An obvious problem with XML databases is that they are too demanding for many users. Documents start out small, and many stay small (e.g. less than one megabyte in the size of the plain text representation). Requiring XML databases, along with their complexity, seems to be overkill. Many users (including this author) may want to write their XML documentation with vi, emacs or a graphical XML editor, or generate the documentation from comments in the source code. Most of these processing tools do not support XML databases. Many users may also want to publish their documentation as files, without depending (too much) on a fancy XML database and Web server up and running.
The other problem that even XML databases do not solve is isolating dependencies in a big document tree. A XML database knows how to “patch” a big document in small pieces, but it cannot know how these patches affect other parts of the document, because the effects depend on markup semantics. For example, if the writer changes the title of a section in the document, the display system needs to know that all the cross references to that section occurring elsewhere in the document must change their labels. If the display system naïvely uses the XML database, without keeping track of the cross reference markup, then it still has to process the whole document again just on changing one section label. This is obviously no better than storing XML in flat text files.
So, in summary, we would like a system that:
It is a given that xhtml2to1 needs to support ID-based linking as well — for IDs are obviously used for linking within a sub-document. But xhtml2to1 does not intend to include features specifically for working with large monolithic documents.
Formatted using xhtml2to1 by Steve Cheng.