Re: Comment handling
On Wed, 4 Jun 2003, Tony Lewis wrote: Adding this function to wget seems reasonable to me, but I'd suggest that it be off by default and enabled from the command line with something like --quirky_comments. why not just have the default wget behavior follow comments explicitly (i've lost track whether wget does that or needs to be ammended) /and/ have an option that goes /beyond/ quirky comments and is just --ignore-comments ? :) /a
Re: Comment handling
Aaron S. Hawley wrote: why not just have the default wget behavior follow comments explicitly (i've lost track whether wget does that or needs to be ammended) /and/ have an option that goes /beyond/ quirky comments and is just --ignore-comments ? :) The issue we've been discussing is what to do about things that almost follow the rules for HTML comments, but don't quite get it right. By default, wget ignores legitimate HTML comments. Tony
Re: Comment handling
Tony Lewis writes: The issue we've been discussing is what to do about things that almost follow the rules for HTML comments, but don't quite get it right. By default, wget ignores legitimate HTML comments. I think the point of the suggested option was to not even try to identify HTML comments and thus treat them as ordinary text. -Larry Jones I kind of resent the manufacturer's implicit assumption that this would amuse me. -- Calvin
Re: Comment handling
Tony Lewis writes: The issue we've been discussing is what to do about things that almost follow the rules for HTML comments, but don't quite get it right. By default, wget ignores legitimate HTML comments. I think the point of the suggested option was to not even try to identify HTML comments and thus treat them as ordinary text. I think that this may be a solution. Any comments (!) ?? A problem would be if the comment comments out HTML code that the page creator _really_ wants to comment :-) Now, Tony Lewis you are right. As far as I have seen each browser handles comments in its own way. I guess that we could use Mozilla's code, or at least the idea behind it. They have probably seen lots of invalid comments. When I started this discussion I thought it would be a piece of cake to handle correctly comments, but now I have changed my mind. [...]
Re: Comment handling
i suppose my proposal should have been called --disobey-comments (comments are already ignored by default). i'm just saying what's going to happen when someone posts to this list: My Web Pages have [insert obscure comment format] for comments and Wget is considering them to (not) be comments. Can you change the [insert Wget comment mode] comment mode to (not) recognize my comments? i think the idea of quirky comments modes are cool, but is it the better solution? /a On Wed, 4 Jun 2003, Aaron S. Hawley wrote: why not just have the default wget behavior follow comments explicitly (i've lost track whether wget does that or needs to be ammended) /and/ have an option that goes /beyond/ quirky comments and is just --ignore-comments ? :) /a
Re: Comment handling
[...] i suppose my proposal should have been called --disobey-comments (comments are already ignored by default). I suppose that this is a good idea, since it won't be enabled by default and someone could enable it if the page he wants to download is very buggy concerning the comments. i'm just saying what's going to happen when someone posts to this list: My Web Pages have [insert obscure comment format] for comments and Wget is considering them to (not) be comments. Can you change the [insert Wget comment mode] comment mode to (not) recognize my comments? i think the idea of quirky comments modes are cool, but is it the better solution? Do you think that the current algorithm shouldn't be improved? Even, a little bit to handle the common mistakes? [...] P.S. Aaron Hawley sorry about the personal email :-(
Re: Comment handling
On Wed, 4 Jun 2003, George Prekas wrote: snip i think the idea of quirky comments modes are cool, but is it the better solution? Do you think that the current algorithm shouldn't be improved? Even, a little bit to handle the common mistakes? i think Wget's default behavior should be improved where reasonable. i know people had profiled Wget's current behavior and profiled proposals for more reasonable behavior, but i can't find a web archive of those posts. /a
Re: Comment handling
Aaron S. Hawley wrote: i'm just saying what's going to happen when someone posts to this list: My Web Pages have [insert obscure comment format] for comments and Wget is considering them to (not) be comments. Can you change the [insert Wget comment mode] comment mode to (not) recognize my comments? One way to implement quirky comments is to allow the user to add their own comment format to the wgetrc file. Tony
Re: Comment handling
[ ... ] I have downloaded Mozilla's source. It was 30MB! Now, I searched where Mozilla handles comments and found mozilla/htmlparser/src/nsHTMLTokens.cpp. Inside it, there are two functions: ConsumeStrictComment and ConsumeQuirksComment. The first one follows the rules, the second one tries to handle even invalid comments and it uses an algorithm like this: 1. Looks for !-- 2. If it finds it, it looks for 3. If it finds it, it checks if behind there are either -- or --! 4. If there are, it quits with an OK, otherwise back to step 2. 5. Now, if it finds EOF while doing steps 2,3, it uses as the close tag of the comment, the first found in the procedure. It will recognise the following comments: !-- my comment -- !-- my comment !-- my comment -- !-- my comment --! !--- arbitrary number of dashes My thoughts on the subject: Before looking at Mozilla, I made my own algorithm. It is based on the following thought: Every comment ends at the , unless this is inside the comment. The hard part is to decide when it is a comment. Well, it has to start with --. But is this enough, I mean just look at the last comment above. To my opinion, a comment to be a comment must start with -- and the next nonblank character should not be - or . That's for now. Please give me some feedback with your thoughts and tell me if you would like the comment handling mechanism of WGet to change. By the way, who was written the current one? Maybe, he can help us with his experience. Regards, George Prekas.
Re: Comment handling
Georg Bauhaus wrote: I don't think so. Actually the rules for SGML comments are somewhat different. Georg, I think we're talking about apples and oranges here. I'm talking about what is legitimate in a comment in an SGML document. I think you're talking about what is legitimate as a comment in an SGML declaration. At any rate, I decided to do some more poking around. I wrote a web page (see http://www.exelana.com/comments.html) with the following variations on comments: !-- Comment -- !-- -- -- ! ! The browsers I tried (Internet Explorer, Mozilla, and Lynx) ignore all of them. I also tried the W3C Markup Validation Service at http://validator.w3.org/ It reported that the last one is not valid: Line 22 column 8: comment started here ! http://validator.w3.org/check?uri=http%3A%2F%2Fwww.exelana.com%2Fcomments.htmldoctype=HTML+2.0charset=us-ascii+%28basic+English%29 The moral of the story: one cannot evaluate an HTML document solely on what any browser (or even all of them) do with it. Tony
Re: Comment handling
So in the example !- there are 5 hyphens, the first two of which can be interpreted as a comment delimiter, as can the second two. But then there is something else following the second two, namely a '-'. So this piece of text is as invalid as !z. What's your opinion, then, about comment handling? I mean, should a comment finish at the or not? Maybe the following will work sufficiently well? We have a candidate for a declaration, that is, we have seen !, and it looks like this is not a DOCTYPE, ENTITY, and so on declaration. If it is, apply the usual processing for these declarations, if any. Now, for supposed comment declaractions, change the algorithm slightly, as outlined below. (Does someone have empirical data about what comments typically might look like? I have seen --! at ends of comments, but that could be rare...) Samples: !-- Please don't use font, use stylesheets instead -- (valid, 1) (It doesn't stop at the first here, if I understand your outline of the algorithm correctly, fine.) !-- I should have written a b here, but it is too late -- (valid, 2) !-- I should have written --a b here, but it is too late -- (invalid, 2a) (Looks tricky to me at first sight, but the presence of a behind --, which is neither white space nor dashes, could trigger resumption of looking for the second pair of dashes. With a bit of luck, there is one, and neither was a a typo and shouldn't be there at all, nor was forgotten before a.) !-- Next: From London - Paris -- (valid, 3) !-- Next: From London -- Paris -- (valid, 3a, but probably surprising) (I think detecting that would be mind reading magic in the general case?) !-- There are apples, and oranges--and no other fruit. -- (invalid, 4) (Just another illustration of 2a) !-- hidden URL: http://mumbo.jumbo.jam/see--here/ -- (invalid, 5, but maybe important) (Yet another illustration of 2a, might be useful to get this right. for example, for extractting URLs from commented JavaScrips.) !-- some text -- (valid, 6) space (separators) before The code does the following: 1. Looks for the ! immediately after , otherwise it aborts. 2. Looks for whitespace or dash. 3. If it finds a dash, it looks for one more, otherwise it aborts. 4. When inside the comment, it looks for a dash. 5. If it finds a dash, it looks for one more, otherwise it aborts. 6. Finally, it looks for a to quit gracefully. This is what I have tried, leaving out EOF. Basically the algorithm is quite tolerant and, after !,, either looks for '[[:space:]]*' or for the next --[[:space]]*. This will include some very invalid comments, but so what? I thought it might blend well with typical wget use. It doesn't handle !-. 1'. Looks for the ! immediately after , otherwise it aborts. 2'. Skips white space. 3'. Looks for , if it finds one, quits gracefully. 4'. If it finds a dash, it looks for one more. Otherwise, i.e. if it does not find a first dash, the rest of input is an incomplete comment, or else another type of declaration, which was precluded. 4a'. If it doesn't find a second one, the search is restarted at 4' at where it looked for the second one. 5'. (there is a second dash) Move ahead. 5a'. Either there is '', quit gracefully. 5b'. Or there is white space, go to 5'. 5c'. Or goto 4'. So this just disregards the 4k requirement, because that isn't known enough to be useful anywhere outside validating parsers anyway? If I may suggest code reuse, not intending offence, I think Mozilla does a fairly good job at handling malformed comments, from what I see (in the browser); could that be used as a source of inspiration? -- Georg
Re: Comment handling
This is what I have tried, leaving out EOF. Basically the algorithm is quite tolerant and, after !,, either looks for '[[:space:]]*' or for the next --[[:space]]*. This will include some very invalid comments, but so what? I thought it might blend well with typical wget use. It doesn't handle !-. And, darn, I have forgotten to allow any number of dashes in addition to the white space before . -- Georg
Re: Comment handling
Georg, I think we're talking about apples and oranges here. I'm talking about what is legitimate in a comment in an SGML document. I think you're talking about what is legitimate as a comment in an SGML declaration. Ah, yes, o.K., I was reacting to valid SGML comments, where legitimate is not defined. Should be different for wget, indeed. I hope my other letter explains. (And, to be nitpicky, an SGML declaration is another defined term which refers to the character sets, capacities, markup minimization, etc, of an SGML parser. :-) At any rate, I decided to do some more poking around. I wrote a web page (see http://www.exelana.com/comments.html) with the following variations on comments: !-- Comment -- !-- -- -- ! ! The browsers I tried (Internet Explorer, Mozilla, and Lynx) ignore all of them. I also tried the W3C Markup Validation Service at http://validator.w3.org/ It reported that the last one is not valid: Line 22 column 8: comment started here ! Which, incidentally, is a confusing error message, as this comment is, in itself, correct. (which you can see removing the middle dashes two lines above it. It's the 4k issue that George Prekas has written about.) http://validator.w3.org/check?uri=http%3A%2F%2Fhome.knuut.de%2Fbauhaus%2Fh2.htm lcharset=utf-8+%28Unicode%2C+worldwide%29doctype=%28detect+automatically%29 -- Georg
Re: Comment handling
After reading http://www.w3c.org/MarkUp/SGML/sgml-lex/sgml-lex I am convinced that !- is a valid SGML (and therefore HTML) comment. Therefore, I believe it is a bug if wget does not recognize such a comment. I don't think so. Actually the rules for SGML comments are somewhat different. First, a comment need not be part of a comment declaration, but may as well appear in markup declarations, e.g. in the role of parameter separators. Example (from HTML 4 strict): !ATTLIST BR %coreattrs; -- id, class, style, title -- There is at least one comment here, namely between the firsts visible comment delimiter (-- before id) and the second -- at the end of the second line. (The coreattrs entity itself has some more comments in its value's text.) In addition, a declaration may contain only comments, and nothing else. This is what is usually referred to as comment in web pages' HTML text. Example of a declaration that contains nothing but comments: !-- a tree -- -- on mars? -- This comment declaration has two comments and a few separators in it. The comment declaration rules are numbered 91, and 92 in the SGML standard. A comment declaration [91] is a markup declaration open (!), optionally followed by a comment (see below) which might be followed by any number of separator-or-comment; the declaration is terminated by markup declaration close (). comment declaration = mdo, (comment, (s | comment)*)?, mdc A comment [92] is a comment delimiter (--), followed by any number of SGML characters, followed by another comment delimiter (--). comment = com, SGML characer*, com (Since the subsentence followed by... in [91] is optional (?), an empty comment declaration will be ! immediately followed by , i.e. ! is a comment, too.) So in the example !- there are 5 hyphens, the first two of which can be interpreted as a comment delimiter, as can the second two. But then there is something else following the second two, namely a '-'. So this piece of text is as invalid as !z. Note: I haven't studied the source to confirm how it handles such a string. Neither have I. Georg
Re: Comment handling
- Original Message - From: Tony Lewis [EMAIL PROTECTED] To: George Prekas [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Saturday, May 31, 2003 8:47 AM Subject: Re: Comment handling George Prekas wrote: I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. The HTML code is legitimate, but it only contains one link. The following three lines constitute a single comment: !-- a href=test2.htmltest2.html/a !-- A comment begins at !-- and ends at --. The trailing on the first of these lines and the leading ! on the third of these lines are part of the comment. That is, the comment text is: a href=test2.htmltest2.html/a ! At any rate, one should not expect predictable behavior for broken HTML. What should wget do with the following? You are probably right. I have pointed this because I have seen pages that use as a separator !-- with lots of dashes and althrough Internet Explorer shows the page, wget can not download it correctly. What do think about finishing the comment at the ? a href=test1.htmltest1.html !-- /a !-- In one version, it might choose to follow the link to test1.html and in another version it might not. Tony
Re: Comment handling
George Prekas wrote: You are probably right. I have pointed this because I have seen pages that use as a separator !-- with lots of dashes and althrough Internet Explorer shows the page, wget can not download it correctly. What do think about finishing the comment at the ? After reading http://www.w3c.org/MarkUp/SGML/sgml-lex/sgml-lex I am convinced that !- is a valid SGML (and therefore HTML) comment. Therefore, I believe it is a bug if wget does not recognize such a comment. Note: I haven't studied the source to confirm how it handles such a string. Tony
Comment handling
I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. Regards, George Prekas P.S. Sorry about this message, but it appears that the first one did not show up in the list.
Re: Comment handling
On Fri, 30 May 2003, George Prekas wrote: I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. is it really an invalid comment? i didn't see the second link when viewing the file in `lynx` or `links`. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. Regards, George Prekas P.S. Sorry about this message, but it appears that the first one did not show up in the list. i think it showed up twice (or maybe i'm getting duplicates). yeah the web archives suck. they should be put on mail.gnu.org
Re: Comment handling
George Prekas wrote: I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. The HTML code is legitimate, but it only contains one link. The following three lines constitute a single comment: !-- a href=test2.htmltest2.html/a !-- A comment begins at !-- and ends at --. The trailing on the first of these lines and the leading ! on the third of these lines are part of the comment. That is, the comment text is: a href=test2.htmltest2.html/a ! At any rate, one should not expect predictable behavior for broken HTML. What should wget do with the following? a href=test1.htmltest1.html !-- /a !-- In one version, it might choose to follow the link to test1.html and in another version it might not. Tony
Comment handling
I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. Regards, George Prekas
Comment handling
I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. Regards, George Prekas