 |
|
 |
| Author |
Message |
Robert Maas, see http://t Guest
Fri Apr 20, 2007 12:48 pm |
Post subject: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
For years I've had needs for parsing HTML, but avoided writing a
full HTML parser because I thought it'd be too much work. So
instead I wrote various hacks that gleaned particular data from
special formats of HTML files (such as Yahoo! Mail folders and
individual messages) while ignoring the bulk of the HTML file.
But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
<cliche>bite the bullet</cliche> and write a genuine HTML parser.
Yesterday (Wednesday) I started work on the tokenizer, using one of
my small Web pages from years ago as the test data:
<http://www.rawbw.com/~rem/WAP.html>
As I was using TDD (Test-Driven Development) I discovered that the
file was still using the *wrong* syntax <p /> to make blank lines
between parts of the text, so I changed them to use valid code, so
now my HTML tokenizer would successfully work on the file, finished
to that point last night.
Then I switched to using the Google-Group Advanced-Search Web page
as test data, and finally got the tokenizer working for it after a
few more hours work today (Thursday).
Then I wrote the routine to take the list of tokens and find all
matching pairs of open tag and closing tag, replacing them with a
single container cell that included everyting between the tags.
For example (:TAG "font" ...) (:TEXT "hello") (:INPUT ...) (:ENDTAG "font")
would be replaced by ("CONTAIN "font" (...) ((:TEXT "hello") (:INPUT ...))).
I single-stepped it at the level of full collapses, all the way to
the end of the test file, so I could watch it and get a feel for
what was happening. It worked perfectly the first time, but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.
Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
<http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt>
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
<select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fadvanced_group_search%3Fhl%3Den>
Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
<http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html>
<http://validator.w3.org/check?uri=http%3A%2F%2Fwww.rawbw.com%2F%7Erem%2FNewPub%2Ftmp-ggadv.html>
Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser. |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Toby A Inkster Guest
Fri Apr 20, 2007 1:46 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
Robert Maas, see http://tinyurl.com/uh3t wrote:
| Quote: | But since I have a whole bunch of current needs for parsing various
kinds of HTML files, and I don't want to have to write a separate
hack for each format, all flakey/bugridden, I finally decided to
cliche>bite the bullet</cliche> and write a genuine HTML parser.
|
Congratulations. Real parsers are fun.
But wouldn't it have been a bit easier to reuse one of the many existing
parsers? e.g. http://opensource.franz.com/xmlutils/xmlutils-dist/phtml.htm
--
Toby A Inkster BSc (Hons) ARCS
http://tobyinkster.co.uk/
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux
* = I'm getting there! |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Tim Bradshaw Guest
Fri Apr 20, 2007 2:57 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On Apr 20, 8:48 am, rem6...@yahoo.com (Robert Maas, see http://tinyurl.com/uh3t)
wrote:
| Quote: | but I saw
an awful lot of bad HTML in the Google-Groups Advanced-Search page,
such as many <b> and <font> that were opened but never closed, and
also lots of <p> <p> <p> that weren't closed either. Even some
unclosed elements of tables.
|
Depending on the version of HTML (on the DTD in use) omitted closing
tags may be perfectly legal. SGML has many options to allow omission
of tags, both closing and opening. This is one of the things that XML
did away with as it makes it impossible to build a parse tree for the
document unless you know the DTD. So obviously they are not omissable
for any document claiming to be XHTML I think.
P for instance has omissable close tags in HTML 4.01
--tim |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
John Thingstad Guest
Fri Apr 20, 2007 5:05 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On Fri, 20 Apr 2007 09:48:18 +0200, Robert Maas, see
http://tinyurl.com/uh3t <rem642b@yahoo.com> wrote:
| Quote: |
Anyway, after spending an hour single-stepping it all, and finding
it working perfectly, I had a DOM (Document Object Model)
structure, i.e. the parse tree, for the HTML file, inside CMUCL, so
then of course I prettyprinted it to disk. Have a look if you're
curious:
http://www.rawbw.com/~rem/NewPub/parsed-ggadv.dat.txt
Any place you see a :TAG that means an opening tag without any
matching close tag. For <br>, and for the various <option> inside a
select>, that's perfectly correct. But for the other stuff I
mentionned such as <b> and <font> that isn't valid HTML and never
was, right? I wonder what the w3c validator says about the HTML?
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.google.com%2Fadvanced_group_search%3Fhl%3Den
Result: Failed validation, 707 errors
No kidding!!! Over seven hundred mistakes in a one-page document!!!
It's amazing my parser actually parses it successfully!!
Actually, to be fair, many of the errors are because the doctype
declaraction claims it's XHTML transitional, which requires
lower-case tags, but in fact most tags are upper case. (And my
parser is case-insensitive, and *only* parses, doesn't validate at
all.) I wonder if all the tags were changed to lower case, how
fewer errors would show up in w3c validator? Modified GG page:
http://www.rawbw.com/~rem/NewPub/tmp-ggadv.html
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.rawbw.com%2F%7Erem%2FNewPub%2Ftmp-ggadv.html
Result: Failed validation, 693 errors
Hmmm, this validation error concerns me:
145. Error Line 174 column 49: end tag for "br" omitted, but OMITTAG
NO was specified.
My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.
|
As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrect
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.
SGML is more difficult to parse. Then there is the fact that many
cites rely on errors in the HTML being handled just like in
Microsoft Explorer. I can't count the number of times I heard that Opera
was broken just to find that it was a HTML error on the web cite that
Explorer got around.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
John Thingstad Guest
Fri Apr 20, 2007 5:14 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On Fri, 20 Apr 2007 14:05:02 +0200, John Thingstad
<john.thingstad@chello.no> wrote:
| Quote: | My guess is some smartypants at Google thought it'd make good P.R.
to declare the document as XHTML instead of HTML, without realizing
that the document wasn't valid XHTML at all, and the DTD used was
totally inappropriate for this document. Does anybody know, from
eyeballing the entire WebPage source, which DOCTYPE/DTD
declaraction would be appropriate to make it almost pass
validation? I bet, with the correct DOCTYPE declaraction, there'd
be only fifty or a hundred validation errors, mostly the kind I
mentionned earlier which I discovered when testing my new parser.
|
Oh, should mention try the HTML 4.0 traditional stylesheet.
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Thomas F. Burdick Guest
Fri Apr 20, 2007 6:03 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On Apr 20, 2:05 pm, "John Thingstad" <john.things...@chello.no> wrote:
| Quote: | As a ex employee of Opera I can say that writing a Web Browser is hard!
It is not so much the parsing of correct HTML as the parsing of incorrect
HTML that poses the problem. Let's face it. It could be simple.
If we all used XHTML and the browser aborted with a error message
when a error occurred. Unfortunately that is hardly the case.
|
This is unfortunate why? Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up? Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it. |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
dpapathanasiou Guest
Fri Apr 20, 2007 8:12 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
| Quote: | This is unfortunate why? Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up? Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it.
|
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today. |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Pascal Costanza Guest
Fri Apr 20, 2007 8:34 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
dpapathanasiou wrote:
| Quote: | This is unfortunate why? Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up? Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it.
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.
|
If early browsers had rejected incorrect html, the web would have never
been that successful.
What's important to keep in mind is that those who create the content
are end-users. It must be easy to create content, and shouldn't require
any specific skills (or not more than absolutely necessary).
Stupid error messages from stupid technology is a hindrance, not an enabler.
Pascal
--
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/ |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Tim Bradshaw Guest
Fri Apr 20, 2007 8:46 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On Apr 20, 4:34 pm, Pascal Costanza <p...@p-cos.net> wrote:
| Quote: |
If early browsers had rejected incorrect html, the web would have never
been that successful.
What's important to keep in mind is that those who create the content
are end-users. It must be easy to create content, and shouldn't require
any specific skills (or not more than absolutely necessary).
Stupid error messages from stupid technology is a hindrance, not an enabler.
|
Well said. |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Ben C Guest
Fri Apr 20, 2007 9:10 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On 2007-04-20, Tim Bradshaw <tfb+google@tfeb.org> wrote:
| Quote: | On Apr 20, 4:34 pm, Pascal Costanza <p...@p-cos.net> wrote:
If early browsers had rejected incorrect html, the web would have never
been that successful.
What's important to keep in mind is that those who create the content
are end-users. It must be easy to create content, and shouldn't require
any specific skills (or not more than absolutely necessary).
Stupid error messages from stupid technology is a hindrance, not an enabler.
Well said.
|
But completely wrong.
If the stupid technology tells you at once what the error is you fix it,
and then you are less confused fours hours later when something doesn't
display the way you were expecting and you eventually track it down to a
missing closing tag somewhere.
It's not as if the authors _want_ to use incorrectly nested tags.
They're just careless mistakes that we all make and that are trivial to
fix if they're pointed out at once, but that take hours if you have to
work back from their eventual consequences. Fixing them sooner rather
than later helps the author more than anyone else.
I can only see a case for not reporting errors where it is close to
certain that they will not have consequences. In most systems such
errors are classified as "Warnings" and can be turned off. |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Jonathan N. Little Guest
Fri Apr 20, 2007 9:18 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
Ben C wrote:
| Quote: | On 2007-04-20, Tim Bradshaw <tfb+google@tfeb.org> wrote:
On Apr 20, 4:34 pm, Pascal Costanza <p...@p-cos.net> wrote:
If early browsers had rejected incorrect html, the web would have never
been that successful.
What's important to keep in mind is that those who create the content
are end-users. It must be easy to create content, and shouldn't require
any specific skills (or not more than absolutely necessary).
Stupid error messages from stupid technology is a hindrance, not an enabler.
Well said.
But completely wrong.
If the stupid technology tells you at once what the error is you fix it,
and then you are less confused fours hours later when something doesn't
display the way you were expecting and you eventually track it down to a
missing closing tag somewhere.
|
Agree totally. That is why IE is abominable for debugging. It is "so
good" at second-guessing intent when junk is thrown at it that fails
miserably when it gets valid markup...
--
Take care,
Jonathan
-------------------
LITTLE WORKS STUDIO
http://www.LittleWorksStudio.com |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Thomas A. Russ Guest
Fri Apr 20, 2007 9:49 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
dpapathanasiou <denis.papathanasiou@gmail.com> writes:
| Quote: | This is unfortunate why? Because of the high correlation between
people who have something to say worth reading and those who can write
XML without screwing it up? Face it, HTML is a markup language
historically created directly by humans, which means you *will* get
good content with syntax errors by authors who will not fix it.
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec, and parsing html would not be the nightmare it
is today.
|
On the other hand, it could also be argued that, especially early on,
before web authoring tools existed, such laxity contributed to the
widespread adoption of html. By making the renderer not particularly
picky about the input, it made it easier for authors to hand create the
html pages without the frustration of having things get rejected and not
appear at all.
That provided a nicer development environment (somewhat reminiscent of
Lisp environments), where things would work, even if not every part of
the document were well-formed and correct. The author could then go
back and fix the places that didn't work. That would be true even if
correct rendering were strict, but I do think that laxness in
enforcement of the standards helped the spread of html in the early
days.
--
Thomas A. Russ, USC/Information Sciences Institute |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Andy Dingley Guest
Fri Apr 20, 2007 10:34 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
On 20 Apr, 16:12, dpapathanasiou <denis.papathanas...@gmail.com>
wrote:
| Quote: | But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec,
|
We'd also still be using HTML 1.0, as the legacy problems would stifle
any change to the standard. |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Paul Wallich Guest
Fri Apr 20, 2007 11:06 pm |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
Andy Dingley wrote:
| Quote: | On 20 Apr, 16:12, dpapathanasiou <denis.papathanas...@gmail.com
wrote:
But this problem was entirely preventable: if Netscape and early
versions of IE had rejected incorrectly-formatted html, both people
hacking raw markup and web authoring tools would have learned to
comply with the spec,
We'd also still be using HTML 1.0, as the legacy problems would stifle
any change to the standard.
|
Remember that originally no one was supposed to write HTML. It was
supposed to be produced automagically by design tools and transducers
operating on existing formatted documents.
You know, the same way that no one is supposed to write in assembler.
paul |
|
| Back to top |
|
 |
|
 |
| Author |
Message |
Robert Maas, see http://t Guest
Fri Apr 27, 2007 2:30 am |
Post subject: Re: Writing HTML parser wasn't as hard as I thought it'd be |
|
 |
 |
 |
| Quote: | From: "Thomas F. Burdick" <tburd...@gmail.com
Face it, HTML is a markup language historically created directly
by humans, which means you *will* get good content with syntax
errors by authors who will not fix it.
|
I'm not talking about occasionally crappy HTML in personal Web
pages. I'm talking about bugs in software that generate the same
crappy HTML millions of times per day, every time anyone anywhere
in the world asks Google to perform a search, the same crappy
mistake in *every* copy of the form emitted by Google's search
engine. Also, the toplevel forms to invoke Google's search engines,
which are fetched via bookmarks or links millions of times per day.
A teensy bit of effort to fix those forms and form-generating
software would fix many millions of Web pages delivered per day. |
|
| Back to top |
|
 |
Page 1 of 1 |
All times are GMT
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|
| | |