Comments on Standards for Web Documents

Pre(r)amble

First page of score to Rite of Spring

This image is provided as a humble reminder that, as markup languages go, HTML is really not that tough.

In a former career path as a music student, I learned this story. It is apocryphal, but I'm going to share it with you.

A European orchestra is making a concert tour of North America, with a program whose first half features a symphonic overture and a piano concerto. It's an ambitious schedule: let's say seventeen cities in nineteen days. For the most part, things go pretty smoothly. They are travelling every day, but they usually arrive in town in time to get to the concert hall in mid-afternoon for a light rehearsal. The orchestra can get a feel for the acoustics of the hall and the pianist can get a feel for the actual keyboard he will be playing. Then they have some time to relax, and then it's 8:00 and time for the concert.

About half-way through their tour, though, disaster strikes. They have travel problems: their flight is delayed, they miss a connection. For one reason or another, they don't arrive in town until 6:30pm. They only have time to rush to the auditorium, throw on their concert attire, warm up, and start to play.

The performance starts out alright. The overture goes well, the soloist comes out for the concerto, they go through an orchestral introduction, the soloist raises his hands over the keyboard for a big entrance, brings them down…

…and all hell breaks loose. The piano is crashingly out of tune. People in the audience gasp out loud, the soloist recoils in horror, and the orchestra actually has to stop in mid-performance. It's the musical equivalent of a train wreck.

What happened? The piano was actually in perfect tune. Like all concert instruments, it was tuned to a benchmark of Middle A. But therein lies the problem. What frequency is Middle A? Being a good North American piano, this one was tuned to the good North American convention of 440 Hertz. But many European orchestras tune to the equally good European convention of 435 Hertz; some ensembles tune to A432, and some, particularly original instruments groups, even tune as low as A430. The difference between A440 and A430 is almost a quarter step, enough to loosen the fillings of the most tone deaf person.

This is, of course, a parable about interoperability.

As I say, this story is apocryphal, and almost certainly could never happen. That's because concert musicians have played this game long too long. They decided hundreds of years ago that the system in which they wanted to operate would be defined by ambiguous standards. (By the way, I usually use the term standard pretty loosely.) That gives them the maximum amount of flexibility in reinterpretation. But in such an ambiguous system, you need two things for groups to perform together, or interoperate. You need a system that provides constant feedback to every performer so each can know what interpretation any other performer is using at any moment; and you need a mechanism to establish whose interpretation takes precedence in the inevitable event of conflict. This is pretty high systems overhead. For a symphony orchestra and conductor, the training required to lay down these systems may represent tens or hundreds of person-years.

In the information and networking biz, we usually decide we can't devote that kind of resources to on-the-fly standards interpretation, so we go the other route. We define our standards with a very low degree of ambiguity. We make them as specific as possible, to the point where being in compliance with the standard is a pretty good guarantee that your system will work with any other system that complies with the standard.

Both of these approaches will work, though both have limitations that must be worked around. In fact, about the only approach that will not work is to define an unambiguous standard and claim to adhere to it without actually doing so. That way you give up the unambiguous guarantee of interoperability without having the feedback system needed for on-the-fly reinterpretation of the standard.

This, of course, is the state of the web today.

I am here, ultimately, because, when Netscape issued the first public beta of their browser, I took one look at it and instantly took a deep and lasting dislike to it. I disliked it for three reasons. First, my desktop workstation at the time was an X terminal running off a DEC Ultrix machine, and Ultrix was one flavor of Unix Netscape never supported. So I couldn't use the program on my own workstation. Second, and most petty, I simply didn't like its interface. Its navigational icons seemed somehow dark and depressing, configuration options never showed up in the menu where I expected them, and so on.

My third reason was perhaps the most substantial complaint I had. I already had a browser I liked a lot, NCSA Mosaic, and I used it successfully on both Unix and Windows computers. Mosaic gave me, the user, a remarkable amount of control over the presentation of documents I read; it let me set presentational styles for almost every element in HTML, and to switch quickly between sets of styles I had, for example, for onscreen display and for printing.

So for better or worse, Netscape has never been my primary browser. Whenever I receive a request to had a page to the Libweb lists of library home pages, I always view the page to make sure its URL is correct and reachable. And so, in 1994 and 1995, and on into 1996, I was using a non-Netscape browser to view an increasing number of pages that had been created only with Netscape's current version in mind. If the page looked right in that version of Netscape, it was presumed to be fit for general use. The result was a lot of pages that looked ugly, wrong, or totally invisible in other browsers.

The Problem

When is a standard not a standard?

Many of the problems plaguing any effort to make today's web work in a standards framework involve two events from 1994. First, with HTML 2.0 already badly behind the times, the W3C put together a draft of HTML 3.0 that the developer community flatly declined to support. HTML 2.0 described the state of the art as of 1994; the draft of HTML 3.0 expired in September 1995, and HTML 3.2 wasn't finalized until January 1997. Basically, for about two years, there was no complete description of what HTML actually was. There were no rules to the game.

This also happened to be when Netscape really took off.

I am about to make several comments that will sound like I'm picking on Netscape. Let me say up front that I don't think Netscape's intentions have been either much better or much worse than other Internet software makers. But their effect has been more noticeable, because Netscape got started before there was an installed mass with its own inertia. From the release of Netscape 1.0 to the release of Internet Explorer 3.0, the Web was almost, but not quite, what Netscape said it was.

So the <CENTER> element got adopted, and even grudgingly written into the specification for HTML 3.2. <BLINK> gets used. Microsoft's <MARQUEE> element and WebTV's <BLACKFACE> element generally do not get used.

Let me offer a few comments on the CENTER element. As I remember back, the first versions of Netscape had people talking about two cool features: it started displaying the beginning of a long document while it was still loading the end, and it let authors center their text onscreen. When you look at a list of other HTML elements, you should notice a comment factor:

H1

First-level header

P

Paragraph

UL

Unordered list

STRONG

Strong emphasis

These are all nouns. They refer to parts of a document, and for better or worse they don't describe how those parts should be presented to a user.

CENTER, on the other hand, is a verb, a command, a page layout instruction. Describing a presentation is all that it does; it doesn't contribute to any description of what a page actually is. Netscape's explanation of CENTER, when it was unveiled, was that they were providing a feature that designers and developers were constantly requesting.

Obviously, it's Netscape's duty as a company to respond to the needs of their users. But what design artist stood up and said, "The Web is an impoverished design environment and browsers are too limited in their display options. All I need to fix that is a way to center text and images." Isn't it telling that there was never a RIGHT element, and certainly never an element for aligning text to a point 33% across the screen, and so forth?

The underground rumor about BLINK is that it was an internal joke at Netscape—a gag to see how obnoxious you could make text onscreen—and that it got out into a public beta by mistake and was actually adopted by web authors before Netscape could take it out. I find myself wondering if CENTER was also an internal trial balloon that escaped to the wild.

"Every Major Version of Netscape…"

It isn't as if Netscape is without a history of letting bad ideas, or at least bad implementations, out the door. There is a sort of mantra in the Usenet HTML authorship newsgroups: every major release of Netscape has supported and tacitly endorsed invalid HTML that broke in the following major release.

That is, authors building their sites according to what Netscape supported have repeatedly found that the beta 1 release of Netscape's next browser is partially or completely unable to display pages as intended in the previous release. Specifically:

Netscape 1.1 allowed authors to use multiple <TITLE> and <BODY> tags. For example:

<TITLE>M</TITLE>
<BODY BGCOLOR="#FFFFFF"> 
<TITLE>My</TITLE> 
<BODY BGCOLOR="#CCCCCC"> 
<TITLE>My P</TITLE> 
<BODY BGCOLOR=;#999999"> 
<TITLE>M Pa</TITLE> 
<BODY BGCOLOR="#666666"> 
<TITLE>My Pag</TITLE> 
<BODY BGCOLOR="#333333"> 
<TITLE>My Page</TITLE> 
<BODY BGCOLOR="#000000"> 
<IMG SRC="my-white-logo.gif">

This would spell out the title "My Page" one letter at a time and fade the background from white to black. In its day, this was a darling of the Kewl D00d school of design. Starting with version 1.2, Netscape honored only the first TITLE and BODY tags, leaving this page, for example, with a title of "M" and a white background, with a white image likely becoming invisible.

Netscape 1.2 still supported two very serious authorship mistakes. It was very confused about comments, and allowed authors to be very confused about them, and it was very too eager to fix problems with quotation marks.

First the comments: Netscape behaved as though it didn't understand where HTML comments come from, and in version 1.x just treated them as a somewhat unusual tag:

<!--

This is a comment

>

This started a comment in Netscape 1.x

None of this appears in the browser window until we get to…

This closed the comment. Text following this did appear.

The trick is that comments are not just a tag: that's why they don't look like other tags. Comments are defined only within SGML declarations; since we don't usually fiddle with the various declarations that comprise HTML, we generally don't recognize the bracket-exclamation mark construct.

<!

--

This is a comment

--

>

Start of declaration

Start of comment

None of this appears in the browser window until we get to…

End of comment

End of declaration

Consider this document:

<HTML>
<HEAD><TITLE>Hello</TITLE></HEAD>
<BODY>
<!--Last edited December 31, 1994>
<H1>Hello...

This document would have worked fine in Netscape 1.x, but would have been completely invisible in Netscape 2.0. (By the way, I should point out that it is still very possible to confuse Netscape, Internet Explorer, and even Opera and Lynx with valid, but strangely formatted, comments.)

Similarly, 1.x tried to make intelligent guesses when confronted with missing quotation marks. Given this markup:

<A HREF="foo.html>Go to foo</A>
<A HREF="bar.html>Go to bar</A>

Netscape would assume the author was making two separate links, to foo.html and to bar.html, and would act as though the closing quote was there. Starting with version 2, Netscape would extend the first HREF attribute to the next quotation mark, making a link to a strangely formatted URL of bar.html>Go to foo</A> <A HREF=

From this came the great Netscape 2.0 Comments 'n' Quotes Debacle, in which it turned out that many authors were not closing their comments and quotes, and in which many authors discovered their pages were broken or partially or entirely invisible. For several months, Usenet newsgroups were awash in vitriolic complaints that the new browser was badly broken and Netscape should never have released something so buggy.

Whereas versions 1.2 and 2.0 threw authors for a loop by removing previous misbehaviors, version 3.0 actually added a new misbehavior. This one is pretty subtle, and I can tell you that I for one have not completely worked around it.

All versions of Netscape through 3.0x tried to make guesses about character entities that lacked their closing semi-colon.

Entities Seen Perceived To Be… Rendered As…

2&lt3
4&times5
Z&uumlrich

2&lt;3
4&times;5
Zürich

2<3
4×5
Zürich

&sect&para&copy

&sect;&para;&copy

§¶©

Netscape 3.0 extended this one step further than most authors wanted. Netscape 3.0 also performed these substitutions in URLs:

<a href="foo.cgi?book=1&section=A&paragraph=3&copyright=yes">

became

<a href="foo.cgi?book=1§tion=A¶graph=3©right=yes>

Netscape 4.0 reversed this behavior, no longer accepting any ambiguous character entities. That is, starting with version 4.0, Netscape correctly requires entities to end either with a semicolon or with a character not allowed in a character entity, such as a space.

Netscape 3.x

The S&ampP 500 stock index was up today…

The S&P 500 stock index was up today

Netscape 4.x

The S&ampP 500 stock index…

The S&ampP stock index

Early reports from developers and alpha testers suggest that Netscape 5.0 might set new records for breaking invalid constructs that appeared to work alright in previous versions. You have been warned.

Who gains from validation?

How many of you run a web site? How many of you are charged, either explicitly or implicitly, with providing material on the web that is:

In my opinion, treating this as five separate tasks will break your back. On the other hand, you can achieve much or all of this simply by sensitive adherence to relevant standards: HTML 4.0 Transitional, CSS level 1, and the W3C's forthcoming accessibility guidelines.

The Players

Authors

I would like to recreate a post I have made a couple of times to the Web4Lib mailing list.

The genesis for this was an unfortunate coincidence in which Libweb received a rash of new pages that at the markup level were essentially gibberish at the same time people on Web4Lib started another group rant about the role of librarians on the web. We are the organizers of information, and by organizing, we preserve access to information that would otherwise be washed away in the currents of time, and so ours is the profession that should guide the development of the virtual digital online wired cybrary of the future, and so on.

I am actually not unsympathetic to this viewpoint, but it often comes across as more self-important than insightful. And as I added yet another viciously ugly, incomprehensible library home page to Libweb, I found myself asking this question: exactly how many web sites in this profession adhere to the information standard that nominally controls them? Or, as the project boiled down, how many library home pages validate? So, in February, 1997, I ran a few hundred URLs through a validator. Since HTML 3.2 had just come out, I waited a couple of months and then repeated the process in April, 1997. I repeated it again in January, 1998, shortly after HTML 4.0 was announced, and then again in July, 1998. In September, 1998, I tried a slightly different tack: instead of looking at home pages, I took fifty home pages at random and then took their links to further pages on their own site.

DATE February 97 April 97 January 98 July 98 September 98
No. of Pages 624 1114 1389 1661 435
Percent Valid 3.85 6.91 5.90 4.82 5.29
Percent under 4 errors 15.54 21.18 21.53 18.42 19.08
Percent over 80 errors 2.24 5.15 5.04 4.27 10.80
Median No. of Errors 13 13 13 14 10
Avg No. of Errors 20 24 23 23 32

This does not give me encouragement. In web terms, this is a pretty long-term longitudinal study, and it shows that the adherence to HTML is poor and not improving.

What has gone up is the number of authors who are turning over what should be the prosaic task of doing the HTML markup to editor programs. In April of 1997, 22% of the pages I looked at identified some browser program in use; that number was up above 36 percent in July of this year, which tells me that these programs, whose core function is to mark up HTML, are incapable of doing so in any way defined by open standards.

Editors

Last year, I was trying to identify HTML editors for OhioLINK staff members to use for documents they were adding to our site. Since most of the programs I considered likely candidates were either shareware or at least had an downloadable evaluation version, I was able to try them out on a little experiment.

I wrote a relatively simple HTML 3.2 document, and then used each of the editors in turn simply to open the document and save it to a new file name. How much damage could that do?

Editor

Errors

Adobe Pagemill 2.0

3

Claris Homepage 2.0

8

HoTMetaL Pro 3.0

0

Netscape Composer (from Communicator 4.0)

1

Microsoft FrontPad (from MSIE 4.0b)

0

Microsoft Word 97

13

Perhaps the problem was in asking these editors to work with an imported file. My next step was to write a best possible duplicate of the original document in each editor.

Editor

Errors

Adobe Pagemill 2.0

6

Claris Homepage 2.0

8

HoTMetaL Pro 3.0

0

Netscape Composer

2

MS FrontPad

0

MS Word 97

26 (and could not complete the assignment)

A year has come and gone, some browsers have been upgraded, and at least a few more have appeared. Time to download a few more eval copies

Editor

Errors

Adobe Pagemill 3.0

3 (import/save as) / 2 (from scratch)

Claris (now Filemaker) Homepage 3.0

12 / 12

Composer (from Communicator 4.5b1)

0 / 1

FrontPad (now Front Page Express) from MSIE 5.0b1

1 / 1

Symantec Visual Page 2.0

2 / 1

Looking at the source of the resulting files, I'm convinced that the errors introduced by these editors are the indirect results of just plain goofy markup created in an attempt to make HTML documents look and act like word processing documents and an insistence on writing to circa Netscape 2.0 browsers whether you want to do that or not.

Here are some of the sillier tagging examples provided by programs which ought to know better.

Editor Lowlights: Not Mistakes, Just Goofiness
My Markup As Converted By Editor

<P>Text</P>

<OL>

<LI>

</OL>

<P>Text

<BR>&nbsp;

<BR>&nbsp;

<BR>&nbsp;

<OL>

<LI>

</OL>

<DIV ALIGN=CENTER>

</DIV>

<DIV ALIGN=CENTER><CENTER>

</CENTER></DIV>

<IMG SRC=foo.gif alt=>

<IMG SRC=file:///D/dir/foo.gif>

<TABLE BORDER=0>

<TABLE WIDTH=1 BORDER=0>

<TD>

<TD WIDTH=50%>

<TD ALIGN=RIGHT>

</TD>

<TD><P ALIGN=RIGHT>

</P></TD>

<BR CLEAR=ALL>

<P><BR CLEAR=LEFT></P>

<TABLE BORDER=1>

</TABLE>

<TABLE BORDER>

</TABLE>

Conclusions

Many authors and webmasters tend to tune out of discussions on HTML validation, saying, "On my site, the important thing is the content." But I think it's a little naïve at this point to believe that markup and content, or syntax and semantics, are so cleanly separable that you can concentrate on just one without affecting the other. Content suggests to us, as authors, sensible ways to mark up our pages; markup suggests to browsers, and by extension to users, how the content is structured. If your markup prevents your pages from being indexed, found, understood, or used by someone who should be using it, then for that user your site has not content.

If that verges on philosophy, my apologies. Here's something more pragmatic. It's another parable, but this one really happened. One afternoon about five years ago, I was at the reference desk of the University of Washington Engineering Library. I got a phone call from a cataloger who was adding notes to the catalog records for a number of theses. She was following up on an 1987 electrical engineering thesis that seemed to include a floppy disk. Could I go look at the thesis and see if it was a 5¼ inch or 3½ inch disk, a PC or Mac disk, etc.? It turned out that the disk required, not a PC or a Mac, but a Commodore 128. When I looked at that thesis, it was only about six years old, and for all intents and purposes, because of one poor format choice, the important part of its research was gone.

At some point—and some of us may already have reached this point without knowing it—we are going to place on our web sites documents whose archival life span will reach that six-year mark. Try to imagine what the browser environment will be like in six years. You can't. But you can guess that six years from now it will be very hard to get hold of a copy of Netscape 4.0, but much easier to get a copy of the HTML 4.0 specification. Using that as a Web Rosetta Stone, a user six years from now at least has access to something that will explain what your document was trying to do and trying to mean.

So, if you need that document to be accessible six years from now, you have two choices. You can commit yourself today to revisiting that document—and every other document on your web site—every six months or so and retagging it to meet the current browser hacks, or you can make sure the document exists in a standard format that will be available for the long term, and never touch it again.

In closing, my question is: which of those two approaches really frees you up to play with content, and which forever shackles you to the tedious job of fixing tags?

Valid HTML 4.0! Made with CSS