Re: HTML parser in Yacc form???

uid#15033@dxal18.cern.ch
Fri, 24 Mar 1995 10:15:22 +0500


In article <AA76@cernvm.cern.ch> you write:
|>>|> I was wondering if there exists a specification of HTML in yacc
|>>|>(or bnr) form. It has probably been done as constructing such a parser is
|>>|>way more easier in this way than with a traditional C subroutine.
|>>
|>>Don't think about it. HTML is not an LR(1) grammar and so trying to use yacc
|>>is only going to cause pain. The best way of parsing SGML is with a top down
|>>recursive descent parser. Try to use yacc and you will end up in all sorts of
|>>troubles, especially with error reporting.
|>
|>Phill is technically correct (that one cannot parse SGML and hence
|>HTML using YACC et al).
|>
|>If one limits oneself to a subset of SGML, it is quite possible to
|>produce a YACC grammer. Dan Connolly has produced such a grammar for
|>HTML by hacking DTD2HTML, and the TEI folks have produced an
|>*excellent* and very *useful* subset of SGML, and the grammar is
|>available at:
|>
|> ftp://ftp-tei.uic.edu/pub/TEI
|>
|>While these can accept come documents that are not quite legal SGML,
|>99.9% of documents I've seen would be both legal withing the TEI
|>grammar, and within SGML.

But why bother?

Parsing SGML with a top down recursive decent parser based on an FSR is
by far the simplest approach to implement and also produces correct code.
Why would anyone want to use an inappropriate tool which does the job less
well and is more difficult to use?

Yacc is OK if you actually have an LR(1) grammar. But its best to steer well
clear of it otherwise. In addition error handling was never really though out
properly for yacc. I've never seen anyone sucessfully use the error
productions without comming a cropper.

HTML2.0 is just about parsable with yacc but HTML3 is pretty awfull. Especially
the maths extensions since they use some of the character set shifting
functions. This part is distinctly non LR(1) and the best, most compact
definition of the grammar is produced using a push-down automata.

I think the problem lies in comp sci classes being taught that bottom up
parsing is `better' and the students not asking why. Goldfarb would not know
an LR(1) grammar if one bit him on the nose. If he had SGML might not fall
into the "much wailing and gnashing of teeth" catogory which it does.

PS: I have discovered that the correct pronunciation of "ASN.1" is "assasin 1".

--
Phillip M. Hallam-Baker

Not Speaking for anyone else.