Re: problem with LF/CR etc.

Hrvoje Niksic Wed, 26 Nov 2003 10:08:03 -0800

Peter GILMAN <[EMAIL PROTECTED]> writes:

> first of all, thanks for taking the time and energy to consider this
> issue.  i was only hoping to pick up a pointer or two; i never
> realized this could turn out to be such a big deal!


Neither did we.  :-)

> 1) Jens' observation that the user will think wget is broken is
> correct.  the immediate reaction is, "it works in my browser; why
> does wget say '404'?"
[...]
> (and, after all, what is the purpose of wget?  is it an html
> verifier, or is it a Web-GET tool?  i submit that evaluation of the
> "correctness" of web code is outside the purview of wget.)

It's true that the point of Wget is not to evaluate correctness of web
pages.  But its purpose is not handling every piece of badly written
HTML on the web, either!  Just like badly written pages work in some
browsers, but not in others, some pages that work in IE will not work
in Wget.  This is nothing new.

As I said, Wget tries to handle badly written code if the mistakes are
either easy to handle or frequent enough to hamper the usefulness of
the program.  Strict comments fall into the second category, and these
embedded newlines fall into the first one.

> conclusion: if it doesn't break anything, and if it makes wget more
> useful, i can think of no reason this capability shouldn't be added.

Agreed.  This patch should fix your case.  It applies to the latest
CVS sources, but it can be easily retrofitted to earlier versions as
well.


2003-11-26  Hrvoje Niksic  <[EMAIL PROTECTED]>

        * html-parse.c (convert_and_copy): Remove embedded newlines when
        AP_TRIM_BLANKS is specified.

Index: src/html-parse.c
===================================================================
RCS file: /pack/anoncvs/wget/src/html-parse.c,v
retrieving revision 1.21
diff -u -r1.21 html-parse.c
--- src/html-parse.c    2003/11/02 16:48:40     1.21
+++ src/html-parse.c    2003/11/26 16:28:29
@@ -360,17 +360,16 @@
      the ASCII range when copying the string.
 
    * AP_TRIM_BLANKS -- ignore blanks at the beginning and at the end
-     of text.  */
+     of text, as well as embedded newlines.  */
 
 static void
 convert_and_copy (struct pool *pool, const char *beg, const char *end, int flags)
 {
   int old_tail = pool->tail;
-  int size;
 
-  /* First, skip blanks if required.  We must do this before entities
-     are processed, so that blanks can still be inserted as, for
-     instance, `&#32;'.  */
+  /* Skip blanks if required.  We must do this before entities are
+     processed, so that blanks can still be inserted as, for instance,
+     `&#32;'.  */
   if (flags & AP_TRIM_BLANKS)
     {
       while (beg < end && ISSPACE (*beg))
@@ -378,7 +377,6 @@
       while (end > beg && ISSPACE (end[-1]))
        --end;
     }
-  size = end - beg;
 
   if (flags & AP_DECODE_ENTITIES)
     {
@@ -391,15 +389,14 @@
         never lengthen it.  */
       const char *from = beg;
       char *to;
+      int squash_newlines = flags & AP_TRIM_BLANKS;
 
       POOL_GROW (pool, end - beg);
       to = pool->contents + pool->tail;
 
       while (from < end)
        {
-         if (*from != '&')
-           *to++ = *from++;
-         else
+         if (*from == '&')
            {
              int entity = decode_entity (&from, end);
              if (entity != -1)
@@ -407,6 +404,10 @@
              else
                *to++ = *from++;
            }
+         else if ((*from == '\n' || *from == '\r') && squash_newlines)
+           ++from;
+         else
+           *to++ = *from++;
        }
       /* Verify that we haven't exceeded the original size.  (It
         shouldn't happen, hence the assert.)  */
Index: src/html-url.c
===================================================================
RCS file: /pack/anoncvs/wget/src/html-url.c,v
retrieving revision 1.40
diff -u -r1.40 html-url.c
--- src/html-url.c      2003/11/09 01:33:33     1.40
+++ src/html-url.c      2003/11/26 16:28:29
@@ -612,9 +612,12 @@
     init_interesting ();
 
   /* Specify MHT_TRIM_VALUES because of buggy HTML generators that
-     generate <a href=" foo"> instead of <a href="foo"> (Netscape
-     ignores spaces as well.)  If you really mean space, use &32; or
-     %20.  */
+     generate <a href=" foo"> instead of <a href="foo"> (browsers
+     ignore spaces as well.)  If you really mean space, use &32; or
+     %20.  MHT_TRIM_VALUES also causes squashing of embedded newlines,
+     e.g. in <img src="foo.[newline]html">.  Such newlines are also
+     ignored by IE and Mozilla and are presumably introduced by
+     writing HTML with editors that force word wrap.  */
   flags = MHT_TRIM_VALUES;
   if (opt.strict_comments)
     flags |= MHT_STRICT_COMMENTS;

Re: problem with LF/CR etc.

Reply via email to