]>
The HTML Validation HOWTO <author>Keith M. Corbett, <htmlurl url="mailto:kmc@specialform.com" name="kmc@specialform.com"> <date>v0.2, 29 October 1995 <abstract> This document explains how to use the &nsgmls parser to validate HTML documents for conformance with the HTML 2.0 document type definition, or "DTD". This DTD is the most commonly accepted SGML based definition of HTML, and thus defines a subset of current practice in HTML markup that is likely to be portable to a wide number of HTML users agents (browsers). </abstract> <toc> <sect>Introduction <p> This is a guide to using the &nsgmls parser to validate and process HTML documents. <sect1>Costs and benefits <p> Using the full features of SGML markup will enrich your HTML documents. However, validating your documents to the HTML DTD has certain cost / benefit tradeoffs, basically because you are dealing with a more circumscribed dialect of HTML than is currently in vogue. The "official" HTML rules for enforcing document structure, and the SGML rules for data content markup, are more restrictive than current practice on the Web. The main issue you must consider is that valid HTML is restricted to a standard set of element tags. There isn't an accepted DTD that accurately reflects "browser HTML" as understood by many client browser programs. For the most part, the HTML 2.0 DTD reflects tags and attributes that were commonly in use on the Web around June 1994. Various efforts to define a more advanced HTML+ or HTML 3.0 DTD have gotten somewhat bogged down. And none of the DTDs in circulation will recognize all of the tags that have been popularized recently by browser vendors such as Netscape and Microsoft. <sect1>Getting started <p> Contrary to popular opinion, working with SGML does not have to cost a lot of time and money. It is possible to build a robust development environment consisting entirely of software that is freely available on a wide range of platforms, including Linux, DOS, and most Unix workstations. Thanks to a few very dedicated folks, all the tools you need to work with SGML have been made publicly available on the Internet. Setting up your environment (the parser and supporting program libraries) takes a bit of work but not nearly as much as one might think. You may also want to peruse an introductory SGML text such as "SGML: An Author's Guide to the Standard Generalized Markup Language" by Martin bryan, or "Practical SGML" by Eric van Herwijnen. <sect>Tools <sect1>The <tt/HTML Check toolkit/ package <p> If you want a completely self-installing / canned package, check out the HalSoft <it/HTML Check Toolkit/ at URL: <htmlurl url="http://www.halsoft.com/html-tk/index.html" name="http://www.halsoft.com/html-tk/index.html"> <p> The only disadvantage of using the HalSoft kit is that it uses the older <tt/sgmls/ parser, which produces error messages that are sometimes (even) more cryptic than those from &nsgmls;. <p> I've used &nsgmls on Linux and Windows (3.x and NT); it is supposed to work on many other platforms as well. <sect1>The &nsgmls parser <p> James Clark has built a software kit called <tt/sp/ which includes the validating SGML parser, &nsgmls;. (This is the successor to the <tt/sgmls/ parser which has long been considered the reference parser.) For information on the <tt/sp/ kit, see URL: <htmlurl url="http://www.jclark.com/sp.html" name="http://www.jclark.com/sp.html"> <p> You can download the kit directly from: <htmlurl url="ftp://ftp.jclark.com/pub/sp/" name="ftp://ftp.jclark.com/pub/sp/"> <p> You may be able to pick up &nsgmls executable files for your platform. Or, download the source kit and follow the directions in the <tt/README/ file for running <tt/make/. <p> Consider creating a high level public directory that will contain SGML-related files. For example, on my Linux PC I have various SGML related directories including: <list> <item>/usr/sgml/bin <item>/usr/sgml/html <item>/usr/sgml/sgmls <item>/usr/sgml/sp </list> <sect1>Download the HTML specification materials <p> The draft standard for HTML 2.0 includes SGML definition files you need to run the parser, namely the DTD (Document Type Definition), SGML Declaration, and entity catalog. To obtain the HTML 2.0 public text, see URL: <htmlurl url="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/" name="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/"> <p> Download and install the following files: <list> <item>DTD <it/html*.dtd/ <item>SGML declaration <it/html.decl/ <item>Entity catalog <it/catalog/ </list> <p> You can add two entries to the HTML entity catalog for ease of use with &nsgmls: <tscreen><code> -- catalog: SGML Open style entity catalog for HTML -- -- $Id: catalog,v 1.2 1994/11/30 23:45:18 connolly Exp $ -- : : -- Additions for ease of use with nsgmls -- SGMLDECL "html.decl" DOCTYPE HTML "html.dtd" </code> </tscreen> <p> Alternatively, you can create a second catalog containing these entries; you will have to pass this catalog to &nsgmls as an argument with the <tt/-m/ switch. <sect>Parsing an HTML document <p>Following is a "cookbook" for validating a single document. Simply invoke the &nsgmls parser and pass it the pathnames of the HTML catalog file(s) and the document: <tscreen> <verb> % nsgmls -s -m /usr/sgml/html/catalog <test.html </verb></tscreen> <p> The <tt/-s/ switch suppresses the parser's output; see below. <sect1>Parser input <p> Your document must conform to SGML, which means, among other things, that the document type must be declared at the beginning of the input. (You can fudge this by prepending the information to the document instance on the nsgmls command line.) Here's a simple HTML document that can be parsed correctly using the scheme I've outlined: <tscreen><code> <!doctype html public "-//IETF//DTD HTML 2.0//EN"> <html> <head> <title>Simple HTML document.</title> </head> <body> <h1>Test document</h1> <p>This is a test document.</p> </body> </html> </code></tscreen> <sect1>Parser output <p> The standard output of &nsgmls is a digested form of the SGML input that processing systems can use as a lexer for navigating the structure of the document. For the purpose of validation, you can throw the standard output away and rely on the error output. <p> If you do want the full output, omit the <tt/-s/ switch and pipe standard output to a file: <tscreen> <verb> % nsgmls -m /usr/sgml/html/catalog <test.html >test.out </verb></tscreen> <sect1>Parser messages <p> Error and warning messages from &nsgmls can be very cryptic. And you may see very many errors from illegal markup. <p> To pipe messages to a file, use the <tt/-f/ switch: <tscreen> <verb> % nsgmls -s -m /usr/sgml/html/catalog -f test.err <test.html </verb></tscreen> <sect1>Return status <p>The parser indicates whether the input document conforms to the HTML DTD in two ways: <list> <item>Return code - the parser returns a 0 exit status on success, non-zero otherwise. <item>Output - if the document conforms to the DTD, the last line of standard output will consist of a single <tt/C/ character. </list> <sect>Resources <p> The HalSoft <it/HTML Check Toolkit/ is at URL: <htmlurl url="http://www.halsoft.com/html-tk/index.html" name="http://www.halsoft.com/html-tk/index.html"> <p> James Clark's page on <tt/sp/ is at URL: <htmlurl url="http://www.jclark.com/sp.html" name="http://www.jclark.com/sp.html"> <p> The W3C page on the HTML specification is at URL: <htmlurl url="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/" name="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/"> <p> Feel free to contact me via email: <htmlurl url="mailto:kmc@specialform.com" name="kmc@specialform.com">. </article>