Re: problem with LF/CR etc.

2003-11-26 Thread Hrvoje Niksic
Peter GILMAN [EMAIL PROTECTED] writes:

 first of all, thanks for taking the time and energy to consider this
 issue.  i was only hoping to pick up a pointer or two; i never
 realized this could turn out to be such a big deal!

Neither did we.  :-)

 1) Jens' observation that the user will think wget is broken is
 correct.  the immediate reaction is, it works in my browser; why
 does wget say '404'?
[...]
 (and, after all, what is the purpose of wget?  is it an html
 verifier, or is it a Web-GET tool?  i submit that evaluation of the
 correctness of web code is outside the purview of wget.)

It's true that the point of Wget is not to evaluate correctness of web
pages.  But its purpose is not handling every piece of badly written
HTML on the web, either!  Just like badly written pages work in some
browsers, but not in others, some pages that work in IE will not work
in Wget.  This is nothing new.

As I said, Wget tries to handle badly written code if the mistakes are
either easy to handle or frequent enough to hamper the usefulness of
the program.  Strict comments fall into the second category, and these
embedded newlines fall into the first one.

 conclusion: if it doesn't break anything, and if it makes wget more
 useful, i can think of no reason this capability shouldn't be added.

Agreed.  This patch should fix your case.  It applies to the latest
CVS sources, but it can be easily retrofitted to earlier versions as
well.


2003-11-26  Hrvoje Niksic  [EMAIL PROTECTED]

* html-parse.c (convert_and_copy): Remove embedded newlines when
AP_TRIM_BLANKS is specified.

Index: src/html-parse.c
===
RCS file: /pack/anoncvs/wget/src/html-parse.c,v
retrieving revision 1.21
diff -u -r1.21 html-parse.c
--- src/html-parse.c2003/11/02 16:48:40 1.21
+++ src/html-parse.c2003/11/26 16:28:29
@@ -360,17 +360,16 @@
  the ASCII range when copying the string.
 
* AP_TRIM_BLANKS -- ignore blanks at the beginning and at the end
- of text.  */
+ of text, as well as embedded newlines.  */
 
 static void
 convert_and_copy (struct pool *pool, const char *beg, const char *end, int flags)
 {
   int old_tail = pool-tail;
-  int size;
 
-  /* First, skip blanks if required.  We must do this before entities
- are processed, so that blanks can still be inserted as, for
- instance, `#32;'.  */
+  /* Skip blanks if required.  We must do this before entities are
+ processed, so that blanks can still be inserted as, for instance,
+ `#32;'.  */
   if (flags  AP_TRIM_BLANKS)
 {
   while (beg  end  ISSPACE (*beg))
@@ -378,7 +377,6 @@
   while (end  beg  ISSPACE (end[-1]))
--end;
 }
-  size = end - beg;
 
   if (flags  AP_DECODE_ENTITIES)
 {
@@ -391,15 +389,14 @@
 never lengthen it.  */
   const char *from = beg;
   char *to;
+  int squash_newlines = flags  AP_TRIM_BLANKS;
 
   POOL_GROW (pool, end - beg);
   to = pool-contents + pool-tail;
 
   while (from  end)
{
- if (*from != '')
-   *to++ = *from++;
- else
+ if (*from == '')
{
  int entity = decode_entity (from, end);
  if (entity != -1)
@@ -407,6 +404,10 @@
  else
*to++ = *from++;
}
+ else if ((*from == '\n' || *from == '\r')  squash_newlines)
+   ++from;
+ else
+   *to++ = *from++;
}
   /* Verify that we haven't exceeded the original size.  (It
 shouldn't happen, hence the assert.)  */
Index: src/html-url.c
===
RCS file: /pack/anoncvs/wget/src/html-url.c,v
retrieving revision 1.40
diff -u -r1.40 html-url.c
--- src/html-url.c  2003/11/09 01:33:33 1.40
+++ src/html-url.c  2003/11/26 16:28:29
@@ -612,9 +612,12 @@
 init_interesting ();
 
   /* Specify MHT_TRIM_VALUES because of buggy HTML generators that
- generate a href= foo instead of a href=foo (Netscape
- ignores spaces as well.)  If you really mean space, use 32; or
- %20.  */
+ generate a href= foo instead of a href=foo (browsers
+ ignore spaces as well.)  If you really mean space, use 32; or
+ %20.  MHT_TRIM_VALUES also causes squashing of embedded newlines,
+ e.g. in img src=foo.[newline]html.  Such newlines are also
+ ignored by IE and Mozilla and are presumably introduced by
+ writing HTML with editors that force word wrap.  */
   flags = MHT_TRIM_VALUES;
   if (opt.strict_comments)
 flags |= MHT_STRICT_COMMENTS;


Re: problem with LF/CR etc.

2003-11-20 Thread Jens Rösner
Hi!

 Do you propose that squashing newlines would break legitimate uses of
 unescaped newlines in links?  
I personally think that this is the main question.
If it doesn't break other things, implement squashing newlines 
as the default behaviour.

 Or are you arguing on principle that
 such practices are too heinous to cater to by default?  
Well, if I may speak openly, 
I don't think wget should be a moralist here.
If the fix is easy to implement and doesn't break things, let's do it. 
After all, ignoring these links does not punish the culprit (the HTML coder)

but the innocent user, who expects that wget will download the site.

 IMHO we should either cater to this by default or not at all.
Agreed.
But if (for whatever reasons) an option is unavoidable, I would 
suggest something like
--relax_html_rules #integer
where integer is a bit-code (I hope that's the right term). 
For example
0 = off
1 (2^0)= smart comment checking
2 (2^1)= smart line-break checking
4 (2^2)= option to come
8 (2^3)= another option to come
So specifiying
wget -m --relax_html_rules 0 URL
would ensure strict HTML obeyance, while
wget -m --relax_html_rules 15 URL
would relax the above mentioned rules
By using this bit-code, one integer is able 
to represent all combinations of relaxations 
by summing up the individual options.
One could even think about 
wget -m --relax_html_rules inf URL
to ensure that _all_ rules are relaxed, 
to be upward compatible with future wget versions.
Whether 
--relax_html_rules inf
or
--relax_html_rules 0
or 
--relax_html_rules another-combination-that-makes-most-sense
should be default, is up to negotiation.
However, I would vote for complete relaxation.

I hope that made a bit of sense
Jens





-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: problem with LF/CR etc.

2003-11-20 Thread Peter GILMAN

greetings, wget people:

first of all, thanks for taking the time and energy to consider this
issue.  i was only hoping to pick up a pointer or two; i never realized
this could turn out to be such a big deal!

now then:

Hrvoje Niksic [EMAIL PROTECTED] rubbed two wires together, resulting
in the following:

...

| Peter, how frequent is this kind of breakage?

for me, it's been infrequent, although it seems to be becoming slightly
less uncommon, perhaps as people are picking up on this trick.


| Did you see it only on one site, or many times?

each time i run into it, it's a different site.

it's been my impression that this is something done with the specific
purpose of defeating wget and similar tools.  i've generally encountered
issues like this on high-volume websites with (presumably) huge
bandwidth bills, and my sense has been that these are overt measures
implemented to reduce bandwidth costs by making it more difficult to
access these sites.


though i'm a mere user, and would never deign to tell anybody how to
develop software, i would like to offer a few thoughts, from a user's
perspective:

1) Jens' observation that the user will think wget is broken is correct.
the immediate reaction is, it works in my browser; why does wget say
'404'?

2) browsers handle these cases.  that means that other people (at
mozilla and microsoft, for example) have run into these situations
before, which in turn means that there are enough such cases in the
wild so as not to be considered corner cases.

3) browsers handle these cases without the need of command-line options.
this seems to imply that doing so breaks nothing, and that no
strict-parsing option is necessary.  it's not an either-or
situation.  (and, after all, what is the purpose of wget?  is it an html
verifier, or is it a Web-GET tool?  i submit that evaluation of the
correctness of web code is outside the purview of wget.)

4) if it's wget's mission to do anything a web browser can do (or, if
you prefer, to be able to emulate a web browser), then it ought to
handle these cases as well.

conclusion: if it doesn't break anything, and if it makes wget more
useful, i can think of no reason this capability shouldn't be added.

just my CHF0.02!   8-)

thanks again for all your time and effort.

regards,

- pete gilman





problem with LF/CR etc.

2003-11-19 Thread Peter GILMAN


hello.

i have run into a problem while using wget: when viewing a web page with
html like this:

   a href=images/IMG_01
   .jpgimg src=images/tnIMG_01
   .jpg/a

browsers (i tested with mozilla and IE) can handle the line breaks in
the urls (presumably stripping them out), but wget chokes on the
linefeeds and carriage returns; it inserts them into the urls, and then
(naturally) fails with a 404:

   --17:17:31--  http://www.someurl.tld/images/IMG_01%0A.jpg
  = `www.someurl.tld/images/IMG_01
   .jpg'
   Connecting to www.someurl.tld[10.0.0.40]:80... connected.
   HTTP request sent, awaiting response... 404 Not Found
   17:17:31 ERROR 404: Not Found.

i've run into variants of this problem in several different places; is
there a way to handle situations like this with wget?

technical details: i am using wget 1.8.2, and my command-line invocation
is typically a very simple:

   wget -m -A jpg http://www.someurl.tld

any tips/clues would be much appreciated.

NOTE: please cc me in any replies, as i am not currently subscribed to
the list.  thanks!

- pete gilman


RE: problem with LF/CR etc.

2003-11-19 Thread Post, Mark K
That is _really_ ugly, and perhaps immoral.  Make it an option, if you must.
Certainly don't make it the default behavior.

Shudder


Mark Post

-Original Message-
From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 19, 2003 4:59 PM
To: Peter GILMAN
Cc: [EMAIL PROTECTED]
Subject: Re: problem with LF/CR etc.


Peter GILMAN [EMAIL PROTECTED] writes:

 i have run into a problem while using wget: when viewing a web page with
 html like this:

a href=images/IMG_01
.jpgimg src=images/tnIMG_01
.jpg/a

Eek!  Are people really doing that?  This is news to me.

 browsers (i tested with mozilla and IE) can handle the line breaks
 in the urls (presumably stripping them out), but wget chokes on the
 linefeeds and carriage returns; it inserts them into the urls, and
 then (naturally) fails with a 404:
[...]

So, Wget should squash all newlines?  It's not hard to implement, but
it feels kind of ... unclean.