Re: Comment handling

2003-06-05 Thread Aaron S. Hawley
On Wed, 4 Jun 2003, Tony Lewis wrote:

 Adding this function to wget seems reasonable to me, but I'd suggest that it
 be off by default and enabled from the command line with something
 like --quirky_comments.

why not just have the default wget behavior follow comments explicitly
(i've lost track whether wget does that or needs to be ammended) /and/
have an option that goes /beyond/ quirky comments and is just
--ignore-comments ? :)

/a


Re: Comment handling

2003-06-05 Thread Tony Lewis
Aaron S. Hawley wrote:

 why not just have the default wget behavior follow comments explicitly
 (i've lost track whether wget does that or needs to be ammended) /and/
 have an option that goes /beyond/ quirky comments and is just
 --ignore-comments ? :)

The issue we've been discussing is what to do about things that almost
follow the rules for HTML comments, but don't quite get it right. By
default, wget ignores legitimate HTML comments.

Tony



Re: Comment handling

2003-06-05 Thread Larry Jones
Tony Lewis writes:
 
 The issue we've been discussing is what to do about things that almost
 follow the rules for HTML comments, but don't quite get it right. By
 default, wget ignores legitimate HTML comments.

I think the point of the suggested option was to not even try to
identify HTML comments and thus treat them as ordinary text.

-Larry Jones

I kind of resent the manufacturer's implicit assumption
that this would amuse me. -- Calvin


Re: Comment handling

2003-06-05 Thread George Prekas

 Tony Lewis writes:
 
  The issue we've been discussing is what to do about things that almost
  follow the rules for HTML comments, but don't quite get it right. By
  default, wget ignores legitimate HTML comments.

 I think the point of the suggested option was to not even try to
 identify HTML comments and thus treat them as ordinary text.

I think that this may be a solution. Any comments (!) ?? A problem would be
if the comment comments out HTML code that the page creator _really_ wants
to comment :-)
Now, Tony Lewis you are right. As far as I have seen each browser handles
comments in its own way.
I guess that we could use Mozilla's code, or at least the idea behind it.
They have probably seen lots of invalid comments.
When I started this discussion I thought it would be a piece of cake to
handle correctly comments, but now I have changed my mind.

[...]



Re: Comment handling

2003-06-05 Thread Aaron S. Hawley
i suppose my proposal should have been called --disobey-comments (comments
are already ignored by default).

i'm just saying what's going to happen when someone posts to this list:
My Web Pages have [insert obscure comment format] for comments and Wget
is considering them to (not) be comments.  Can you change the [insert
Wget comment mode] comment mode to (not) recognize my comments?

i think the idea of quirky comments modes are cool, but is it the better
solution?
/a

On Wed, 4 Jun 2003, Aaron S. Hawley wrote:

 why not just have the default wget behavior follow comments explicitly
 (i've lost track whether wget does that or needs to be ammended) /and/
 have an option that goes /beyond/ quirky comments and is just
 --ignore-comments ? :)

 /a


Re: Comment handling

2003-06-05 Thread George Prekas
[...]

 i suppose my proposal should have been called --disobey-comments (comments
 are already ignored by default).

I suppose that this is a good idea, since it won't be enabled by default and
someone could enable it if the page he wants to download is very buggy
concerning the comments.


 i'm just saying what's going to happen when someone posts to this list:
 My Web Pages have [insert obscure comment format] for comments and Wget
 is considering them to (not) be comments.  Can you change the [insert
 Wget comment mode] comment mode to (not) recognize my comments?

 i think the idea of quirky comments modes are cool, but is it the better
 solution?

Do you think that the current algorithm shouldn't be improved? Even, a
little bit to handle the common mistakes?

[...]

P.S. Aaron Hawley sorry about the personal email :-(



Re: Comment handling

2003-06-05 Thread Aaron S. Hawley
On Wed, 4 Jun 2003, George Prekas wrote:

 snip

  i think the idea of quirky comments modes are cool, but is it the better
  solution?

 Do you think that the current algorithm shouldn't be improved? Even, a
 little bit to handle the common mistakes?

i think Wget's default behavior should be improved where reasonable.  i
know people had profiled Wget's current behavior and profiled proposals
for more reasonable behavior, but i can't find a web archive of those
posts.

/a


Re: Comment handling

2003-06-05 Thread Tony Lewis
Aaron S. Hawley wrote:

 i'm just saying what's going to happen when someone posts to this list:
 My Web Pages have [insert obscure comment format] for comments and Wget
 is considering them to (not) be comments.  Can you change the [insert
 Wget comment mode] comment mode to (not) recognize my comments?

One way to implement quirky comments is to allow the user to add their own
comment format to the wgetrc file.

Tony



Re: Comment handling

2003-06-04 Thread George Prekas
[ ... ]

I have downloaded Mozilla's source. It was 30MB! Now, I searched where
Mozilla handles comments and found mozilla/htmlparser/src/nsHTMLTokens.cpp.
Inside it, there are two functions: ConsumeStrictComment and
ConsumeQuirksComment. The first one follows the rules, the second one tries
to handle even invalid comments and it uses an algorithm like this:
1. Looks for !--
2. If it finds it, it looks for 
3. If it finds it, it checks if behind  there are either -- or --!
4. If there are, it quits with an OK, otherwise back to step 2.
5. Now, if it finds EOF while doing steps 2,3, it uses as the close tag of
the comment, the first  found in the procedure.

It will recognise the following comments:
!-- my comment --
!-- my comment 
!-- my comment  --
!-- my comment  --!
!--- arbitrary number of dashes

My thoughts on the subject:
Before looking at Mozilla, I made my own algorithm. It is based on the
following thought: Every comment ends at the , unless this  is inside the
comment. The hard part is to decide when it is a comment. Well, it has to
start with --. But is this enough, I mean just look at the last comment
above. To my opinion, a comment to be a comment must start with -- and the
next nonblank character should not be - or .

That's for now. Please give me some feedback with your thoughts and tell me
if you would like the comment handling mechanism of WGet to change. By the
way, who was written the current one? Maybe, he can help us with his
experience.

Regards,
George Prekas.



Re: Comment handling

2003-06-03 Thread Tony Lewis
Georg Bauhaus wrote:


 I don't think so. Actually the rules for SGML comments are
 somewhat different.

Georg, I think we're talking about apples and oranges here. I'm talking
about what is legitimate in a comment in an SGML document. I think you're
talking about what is legitimate as a comment in an SGML declaration.

At any rate, I decided to do some more poking around. I wrote a web page
(see http://www.exelana.com/comments.html) with the following variations on
comments:
!-- Comment --
!-- -- --
!
!

The browsers I tried (Internet Explorer, Mozilla, and Lynx) ignore all of
them. I also tried the W3C Markup Validation Service at
http://validator.w3.org/

It reported that the last one is not valid:

Line 22 column 8: comment started here
!

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.exelana.com%2Fcomments.htmldoctype=HTML+2.0charset=us-ascii+%28basic+English%29

The moral of the story: one cannot evaluate an HTML document solely on what
any browser (or even all of them) do with it.

Tony



Re: Comment handling

2003-06-03 Thread Georg Bauhaus
  So in the example !- there are 5 hyphens, the first two
  of which can be interpreted as a comment delimiter, as can
  the second two. But then there is something else following the
  second two, namely a '-'. So this piece of text is as invalid
  as !z.
 
 What's your opinion, then, about comment handling? I mean, should a comment
 finish at the  or not?

Maybe the following will work sufficiently well?
We have a candidate for a declaration, that is, we have seen !,
and it looks like this is not a DOCTYPE, ENTITY, and so on
declaration. If it is, apply the usual processing for these declarations,
if any. Now, for supposed comment declaractions, change the algorithm
slightly, as outlined below.

(Does someone have empirical data about what comments typically
might look like? I have seen --! at ends of comments, but that could
be rare...)

Samples:
!-- Please don't use font, use stylesheets instead -- (valid, 1)

(It doesn't stop at the first  here, if I understand your outline
of the algorithm correctly, fine.)

!-- I should have written a  b here, but it is too late -- (valid, 2)
!-- I should have written --a  b here, but it is too late -- (invalid, 2a)

(Looks tricky to me at first sight, but the presence of a behind --, which
is neither white space nor dashes, could trigger resumption of looking for the
second pair of dashes. With a bit of luck, there is one, and  neither was a 
a typo and shouldn't be there at all, nor was  forgotten before a.)

!-- Next: From London - Paris -- (valid, 3)
!-- Next: From London -- Paris -- (valid, 3a, but probably surprising)

(I think detecting that would be mind reading magic in the general case?)

!-- There are apples, and oranges--and no other fruit. -- (invalid, 4)

(Just another illustration of 2a)

!-- hidden URL: http://mumbo.jumbo.jam/see--here/ -- (invalid, 5, but
maybe important)

(Yet another illustration of 2a, might be useful to get this right.
for example, for extractting URLs from commented JavaScrips.)

!-- some text --   (valid, 6)

space (separators) before 

 The code does the following:
 1. Looks for the ! immediately after , otherwise it aborts.
 2. Looks for whitespace or dash.
 3. If it finds a dash, it looks for one more, otherwise it aborts.
 4. When inside the comment, it looks for a dash.
 5. If it finds a dash, it looks for one more, otherwise it aborts.
 6. Finally, it looks for a  to quit gracefully.

This is what I have tried, leaving out EOF. Basically the algorithm is quite
tolerant and, after !,, either looks for '[[:space:]]*' or for the next
--[[:space]]*. This will include some very invalid comments, but so what? I
thought it might blend well with typical wget use. It doesn't handle !-.

1'. Looks for the ! immediately after , otherwise it aborts.
2'. Skips white space.
3'. Looks for , if it finds one, quits gracefully.
4'. If it finds a dash, it looks for one more. Otherwise, i.e.
if it does not find a first dash, the rest of input is an incomplete
comment, or else another type of declaration, which was precluded.
4a'. If it doesn't find a second one, the search is restarted at 4'
at where it looked for the second one. 
5'. (there is a second dash) Move ahead.
5a'. Either there is '', quit gracefully.
5b'. Or there is white space, go to 5'.
5c'. Or goto 4'.


So this just disregards the 4k requirement, because that isn't known
enough to be useful anywhere outside validating parsers anyway?

If I may suggest code reuse, not intending offence, I think Mozilla
does a fairly good job at handling malformed comments, from what I
see (in the browser); could that be used as a source of inspiration?


-- Georg





Re: Comment handling

2003-06-03 Thread Georg Bauhaus
 
 This is what I have tried, leaving out EOF. Basically the algorithm is quite
 tolerant and, after !,, either looks for '[[:space:]]*' or for the next
 --[[:space]]*. This will include some very invalid comments, but so what? 
 I
 thought it might blend well with typical wget use. It doesn't handle !-.

And, darn, I have forgotten to allow any number of dashes in addition
to the white space before .



-- Georg




Re: Comment handling

2003-06-03 Thread Georg Bauhaus
 Georg, I think we're talking about apples and oranges here. I'm talking
 about what is legitimate in a comment in an SGML document. I think you're
 talking about what is legitimate as a comment in an SGML declaration.

Ah, yes, o.K., I was reacting to valid SGML comments, where legitimate
is not defined. Should be different for wget, indeed. I hope my other
letter explains.
(And, to be nitpicky, an SGML declaration is another defined term
which refers to the character sets, capacities, markup minimization,
etc, of an SGML parser. :-)

 At any rate, I decided to do some more poking around. I wrote a web page
 (see http://www.exelana.com/comments.html) with the following variations on
 comments:
 !-- Comment --
 !-- -- --
 !
 !
 
 The browsers I tried (Internet Explorer, Mozilla, and Lynx) ignore all of
 them. I also tried the W3C Markup Validation Service at
 http://validator.w3.org/
 
 It reported that the last one is not valid:
 
 Line 22 column 8: comment started here
 !

Which, incidentally, is a confusing error message, as this comment
is, in itself, correct. (which you can see removing the middle dashes
two lines above it. It's the 4k issue that George Prekas has written
about.)

http://validator.w3.org/check?uri=http%3A%2F%2Fhome.knuut.de%2Fbauhaus%2Fh2.htm
lcharset=utf-8+%28Unicode%2C+worldwide%29doctype=%28detect+automatically%29


-- Georg




Re: Comment handling

2003-06-02 Thread Georg Bauhaus
 After reading http://www.w3c.org/MarkUp/SGML/sgml-lex/sgml-lex I am
 convinced that !- is a valid SGML (and therefore HTML) comment.
 Therefore, I believe it is a bug if wget does not recognize such a comment.

I don't think so. Actually the rules for SGML comments are
somewhat different. First, a comment need not be part of
a comment declaration, but may as well appear in markup
declarations, e.g. in the role of parameter separators.

Example (from HTML 4 strict):

!ATTLIST BR
  %coreattrs;  -- id, class, style, title --
  

There is at least one comment here, namely between the firsts
visible comment delimiter (-- before  id) and the second -- at
the end of the second line. (The coreattrs entity itself has
some more comments in its value's text.)

In addition, a declaration may contain only comments, and nothing
else. This is what is usually referred to as comment in web pages'
HTML text.

Example of a declaration that contains nothing but comments:

!-- a tree --
  -- on mars? --
  

This comment declaration has two comments and a few separators
in it.

The comment declaration rules are numbered 91, and 92 in the SGML
standard.

A comment declaration [91] is a markup declaration open (!), optionally
followed by a comment (see below) which might be followed by any number
of separator-or-comment; the declaration is terminated by
markup declaration close ().

  comment declaration = mdo, (comment, (s | comment)*)?, mdc

A comment [92] is a comment delimiter (--),
followed by any number of SGML characters, followed
by another comment delimiter (--).

  comment = com, SGML characer*, com

(Since the subsentence followed by... in [91] is optional (?),
an empty comment declaration will  be ! immediately followed
by , i.e. ! is a comment, too.)

So in the example !- there are 5 hyphens, the first two
of which can be interpreted as a comment delimiter, as can
the second two. But then there is something else following the
second two, namely a '-'. So this piece of text is as invalid
as !z.


 Note: I haven't studied the source to confirm how it handles such a string.

Neither have I.

Georg




Re: Comment handling

2003-06-01 Thread George Prekas

- Original Message -
From: Tony Lewis [EMAIL PROTECTED]
To: George Prekas [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Saturday, May 31, 2003 8:47 AM
Subject: Re: Comment handling


 George Prekas wrote:


  I have found a bug in Wget version 1.8.2 concerning comment handling (
 !--
  comment -- ). Take a look at the following illegal HTML code:
  HTML
  BODY
  a href=test1.htmltest1.html/a
  !--
  a href=test2.htmltest2.html/a
  !--
  /BODY
  /HTML
 
  Now, save the above snippet as test.html and try wget -Fi test.html. You
  will notice that it doesn't recognise the second link. I have found a
  solution to the above situation and have properly patched html-parse.c
and
 I
  would like some info on how can I give you the patch.

 The HTML code is legitimate, but it only contains one link. The following
 three lines constitute a single comment:

 !--
 a href=test2.htmltest2.html/a
 !--

 A comment begins at !-- and ends at --. The trailing  on the
first
 of these lines and the leading ! on the third of these lines are part
of
 the comment. That is, the comment text is:

 
 a href=test2.htmltest2.html/a
 !

 At any rate, one should not expect predictable behavior for broken HTML.
 What should wget do with the following?

You are probably right. I have pointed this because I have seen pages that
use as a separator !-- with lots of dashes and althrough
Internet Explorer shows the page, wget can not download it correctly. What
do think about finishing the comment at the ?


 a href=test1.htmltest1.html
 !--
 /a
 !--

 In one version, it might choose to follow the link to test1.html and in
 another version it might not.

 Tony





Re: Comment handling

2003-06-01 Thread Tony Lewis
George Prekas wrote:

 You are probably right. I have pointed this because I have seen pages that
 use as a separator !-- with lots of dashes and althrough
 Internet Explorer shows the page, wget can not download it correctly. What
 do think about finishing the comment at the ?

After reading http://www.w3c.org/MarkUp/SGML/sgml-lex/sgml-lex I am
convinced that !- is a valid SGML (and therefore HTML) comment.
Therefore, I believe it is a bug if wget does not recognize such a comment.

Note: I haven't studied the source to confirm how it handles such a string.

Tony



Comment handling

2003-05-31 Thread George Prekas
I have found a bug in Wget version 1.8.2 concerning comment handling ( !--
comment -- ). Take a look at the following illegal HTML code:
HTML
BODY
a href=test1.htmltest1.html/a
!--
a href=test2.htmltest2.html/a
!--
/BODY
/HTML

Now, save the above snippet as test.html and try wget -Fi test.html. You
will notice that it doesn't recognise the second link. I have found a
solution to the above situation and have properly patched html-parse.c and I
would like some info on how can I give you the patch.

Regards,
George Prekas


P.S. Sorry about this message, but it appears that the first one did not show up in 
the list.


Re: Comment handling

2003-05-31 Thread Aaron S. Hawley
On Fri, 30 May 2003, George Prekas wrote:

 I have found a bug in Wget version 1.8.2 concerning comment handling ( !--
 comment -- ). Take a look at the following illegal HTML code:
 HTML
 BODY
 a href=test1.htmltest1.html/a
 !--
 a href=test2.htmltest2.html/a
 !--
 /BODY
 /HTML

 Now, save the above snippet as test.html and try wget -Fi test.html. You
 will notice that it doesn't recognise the second link.

is it really an invalid comment?  i didn't see the second link when
viewing the file in `lynx` or `links`.

 I have found a solution to the above situation and have properly patched
 html-parse.c and I would like some info on how can I give you the patch.

 Regards,
 George Prekas

 P.S. Sorry about this message, but it appears that the first one did not
 show up in the list.

i think it showed up twice (or maybe i'm getting duplicates).  yeah the
web archives suck.  they should be put on mail.gnu.org


Re: Comment handling

2003-05-31 Thread Tony Lewis
George Prekas wrote:


 I have found a bug in Wget version 1.8.2 concerning comment handling (
!--
 comment -- ). Take a look at the following illegal HTML code:
 HTML
 BODY
 a href=test1.htmltest1.html/a
 !--
 a href=test2.htmltest2.html/a
 !--
 /BODY
 /HTML

 Now, save the above snippet as test.html and try wget -Fi test.html. You
 will notice that it doesn't recognise the second link. I have found a
 solution to the above situation and have properly patched html-parse.c and
I
 would like some info on how can I give you the patch.

The HTML code is legitimate, but it only contains one link. The following
three lines constitute a single comment:

!--
a href=test2.htmltest2.html/a
!--

A comment begins at !-- and ends at --. The trailing  on the first
of these lines and the leading ! on the third of these lines are part of
the comment. That is, the comment text is:


a href=test2.htmltest2.html/a
!

At any rate, one should not expect predictable behavior for broken HTML.
What should wget do with the following?

a href=test1.htmltest1.html
!--
/a
!--

In one version, it might choose to follow the link to test1.html and in
another version it might not.

Tony



Comment handling

2003-05-30 Thread George Prekas
I have found a bug in Wget version 1.8.2 concerning comment handling ( !--
comment -- ). Take a look at the following illegal HTML code:
HTML
BODY
a href=test1.htmltest1.html/a
!--
a href=test2.htmltest2.html/a
!--
/BODY
/HTML

Now, save the above snippet as test.html and try wget -Fi test.html. You
will notice that it doesn't recognise the second link. I have found a
solution to the above situation and have properly patched html-parse.c and I
would like some info on how can I give you the patch.

Regards,
George Prekas




Comment handling

2003-05-30 Thread George Prekas
I have found a bug in Wget version 1.8.2 concerning comment handling ( !--
comment -- ). Take a look at the following illegal HTML code:
HTML
BODY
a href=test1.htmltest1.html/a
!--
a href=test2.htmltest2.html/a
!--
/BODY
/HTML

Now, save the above snippet as test.html and try wget -Fi test.html. You
will notice that it doesn't recognise the second link. I have found a
solution to the above situation and have properly patched html-parse.c and I
would like some info on how can I give you the patch.

Regards,
George Prekas