Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Stefan Eissing
Theoretically, a HTTP/1.0 server should accept an unknown content-length
if the connection is closed after the request.
Unfortunately, the response 411 Length Required, is only defined in 
HTTP/1.1.

//Stefan

Am Dienstag, 07.10.03, um 01:12 Uhr (Europe/Berlin) schrieb Hrvoje 
Niksic:

As I was writing the manual for `--post', I decided that I wasn't
happy with this part:
Please be aware that Wget needs to know the size of the POST data
in advance.  Therefore the argument to @code{--post-file} must be
a regular file; specifying a FIFO or something like
@file{/dev/stdin} won't work.
My first impulse was to bemoan Wget's antiquated HTTP code which
doesn't understand chunked transfer.  But, coming to think of it,
even if Wget used HTTP/1.1, I don't see how a client can send chunked
requests and interoperate with HTTP/1.0 servers.
The thing is, to be certain that you can use chunked transfer, you
have to know you're dealing with an HTTP/1.1 server.  But you can't
know that until you receive a response.  And you don't get a response
until you've finished sending the request.  A chicken-and-egg problem!
Of course, once a response is received, we could remember that we're
dealing with an HTTP/1.1 server, but that information is all but
useless, since Wget's `--post' is typically used to POST information
to one URL and exit.
Is there a sane way to stream data to HTTP/1.0 servers that expect
POST?




Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Daniel Stenberg
On Tue, 7 Oct 2003, Hrvoje Niksic wrote:

 My first impulse was to bemoan Wget's antiquated HTTP code which doesn't
 understand chunked transfer.  But, coming to think of it, even if Wget
 used HTTP/1.1, I don't see how a client can send chunked requests and
 interoperate with HTTP/1.0 servers.

 The thing is, to be certain that you can use chunked transfer, you
 have to know you're dealing with an HTTP/1.1 server.  But you can't
 know that until you receive a response.  And you don't get a response
 until you've finished sending the request.  A chicken-and-egg problem!

The only way to deal with this automaticly, that I can think of, is to use a
Expect: 100-continue request-header and based on the 100-response you can
decide if the server is 1.1 or not.

Other than that, I think a command line option is the only choice.

-- 
 -=- Daniel Stenberg -=- http://daniel.haxx.se -=-
  ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol


Re: wget 1.9 - behaviour change in recursive downloads

2003-10-07 Thread Jochen Roderburg
Zitat von Hrvoje Niksic [EMAIL PROTECTED]:

 Jochen Roderburg [EMAIL PROTECTED] writes:
 
  Zitat von Hrvoje Niksic [EMAIL PROTECTED]:
 
  It's a feature.  `-A zip' means `-A zip', not `-A zip,html'.  Wget
  downloads the HTML files only because it absolutely has to, in order
  to recurse through them.  After it finds the links in them, it deletes
  them.
 
  Hmm, so it has really been an undetected error over all the years
  ;-) ?
 
 s/undetected/unfixed/
 
 At least I've always considered it an error.  I didn't know people
 depended on it.

Well, *depend* is a rather strong expression for that ;-)
It worked that way always, I got used to it, I never really thought if it was
correct or not, because I had a use for it. So I was astonished, when these
files suddenly disappeared.

As I wrote already, I will mention them explicitly now. I think, the worst that
will happen is that I get a few more of them than before.

Perhaps the whole thing could be mentioned in the documentation of the
accept/reject option. Current there is only this sentence there:

 Note that these two options do not affect the downloading of HTML
 files; Wget must load all the HTMLs to know where to go at
 all--recursive retrieval would make no sense otherwise.

J. Roderburg





Re: some wget patches against beta3

2003-10-07 Thread Karl Eichwalder
Hrvoje Niksic [EMAIL PROTECTED] writes:

 As for the Polish translation, translations are normally handled
 through the Translation Project.  The TP robot is currently down, but
 I assume it will be back up soon, and then we'll submit the POT file
 and update the translations /en masse/.

It took a little bit longer than expected but now, the robot is up and
running again.  This morning (CET) I installed b3 for translation.



Re: -q and -S are incompatible

2003-10-07 Thread Hrvoje Niksic
Dan Jacobson [EMAIL PROTECTED] writes:

 -q and -S are incompatible and should perhaps produce errors and be
 noted thus in the docs.

They seem to work as I'd expect -- `-q' tells Wget to print *nothing*,
and that's what happens.  The output Wget would have generated does
contain HTTP headers, among other things, but it never gets printed.

 BTW, there seems no way to get the -S output, but no progress
 indicator.  -nv, -q kill them both.

It's a bug that `-nv' kills `-S' output, I think.

 P.S. one shouldn't have to confirm each bug submission. Once should
 be enough.

You're right.  :-(  I'll ask the sunsite people if there's a way to
establish some form of white lists...



Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

 As for the Polish translation, translations are normally handled
 through the Translation Project.  The TP robot is currently down, but
 I assume it will be back up soon, and then we'll submit the POT file
 and update the translations /en masse/.

 It took a little bit longer than expected but now, the robot is up and
 running again.  This morning (CET) I installed b3 for translation.

However, http://www2.iro.umontreal.ca/~gnutra/registry.cgi?domain=wget
still shows `wget-1.8.2.pot' to be the current template for [the]
domain.  Also, my Croatian translation of 1.9 doesn't seem to have
made it in.  Is that expected?


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder [EMAIL PROTECTED] writes:

 Also, my Croatian translation of 1.9 doesn't seem to have made it
 in.  Is that expected?

 Unfortunately, yes.  Will you please resubmit it with the subject line
 updated (IIRC, it's now):

 TP-Robot wget-1.9-b3.hr.po

I'm not sure what b3 is, but the version in the POT file was
supposed to be beta3.  Was there a misunderstanding somewhere along
the line?


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

 I'm not sure what b3 is, but the version in the POT file was
 supposed to be beta3.  Was there a misunderstanding somewhere along
 the line?

 Yes, the robot does not like beta3 as part of the version
 string. b3 or pre3 are okay.

Ouch.  Why does the robot care about version names at all?


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder [EMAIL PROTECTED] writes:

 Hrvoje Niksic [EMAIL PROTECTED] writes:

 Ouch.  Why does the robot care about version names at all?

 It must know about the sequences; this is important for merging
 issues.  IIRC, we have at least these sequences supported by the
 robot:

 1.2 - 1.2.1 - 1.2.2 - 1.3 etc.

 1.2 - 1.2a - 1.2b - 1.3

 1.2 - 1.3-pre1 - 1.3-pre2 - 1.3

 1.2 - 1.3-b1 - 1.3-b2 - 1.3

Thanks for the clarification, Karl.  But as a maintainer of a project
that tries to use the robot, I must say that I'm not happy about this.

If the robot absolutely must be able to collate versions, then it
should be smarter about it and support a larger array of formats in
use out there.  See `dpkg' for an example of how it can be done,
although the TP robot certainly doesn't need to do all that `dpkg'
does.

This way, unless I'm missing something, the robot seems to be in the
position to dictate its very narrow-minded versioning scheme to the
projects that would only like to use it (the robot).  That's really
bad.  But what's even worse is that something or someone silently
changed beta3 to b3 in the POT, and then failed to perform the
same change for my translation, which caused it to get dropped without
notice.  Returning an error that says your version number is
unparsable to this piece of software, you must use one of ...
instead would be more correct in the long run.

Is the robot written in Python?  Would you consider it for inclusion
if I donated a function that performed the comparison more fully
(provided, of course, that the code meets your standards of quality)?


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

 Please be aware that Wget needs to know the size of the POST data
 in advance.  Therefore the argument to @code{--post-file} must be
 a regular file; specifying a FIFO or something like
 @file{/dev/stdin} won't work.

There's nothing that says you have to read the data after you've started
sending the POST. Why not just read the --post-file before constructing the
request so that you know how big it is?

 My first impulse was to bemoan Wget's antiquated HTTP code which
 doesn't understand chunked transfer.  But, coming to think of it,
 even if Wget used HTTP/1.1, I don't see how a client can send chunked
 requests and interoperate with HTTP/1.0 servers.

How do browsers figure out whether they can do a chunked transfer or not?

Tony



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:

 Please be aware that Wget needs to know the size of the POST
 data in advance.  Therefore the argument to @code{--post-file}
 must be a regular file; specifying a FIFO or something like
 @file{/dev/stdin} won't work.

 There's nothing that says you have to read the data after you've
 started sending the POST. Why not just read the --post-file before
 constructing the request so that you know how big it is?

I don't understand what you're proposing.  Reading the whole file in
memory is too memory-intensive for large files (one could presumably
POST really huge files, CD images or whatever).

What the current code does is: determine the file size, send
Content-Length, read the file in chunks (up to the promised size) and
send those chunks to the server.  But that works only with regular
files.  It would be really nice to be able to say something like:

mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin

 My first impulse was to bemoan Wget's antiquated HTTP code which
 doesn't understand chunked transfer.  But, coming to think of it,
 even if Wget used HTTP/1.1, I don't see how a client can send
 chunked requests and interoperate with HTTP/1.0 servers.

 How do browsers figure out whether they can do a chunked transfer or
 not?

I haven't checked, but I'm 99% convinced that browsers simply don't
give a shit about non-regular files.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Stefan Eissing
Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje 
Niksic:
What the current code does is: determine the file size, send
Content-Length, read the file in chunks (up to the promised size) and
send those chunks to the server.  But that works only with regular
files.  It would be really nice to be able to say something like:
mkisofs blabla | wget http://burner/localburn.cgi --post-file 
/dev/stdin
That would indeed be nice. Since I'm coming from the WebDAV side
of life: does wget allow the use of PUT?
My first impulse was to bemoan Wget's antiquated HTTP code which
doesn't understand chunked transfer.  But, coming to think of it,
even if Wget used HTTP/1.1, I don't see how a client can send
chunked requests and interoperate with HTTP/1.0 servers.
How do browsers figure out whether they can do a chunked transfer or
not?
I haven't checked, but I'm 99% convinced that browsers simply don't
give a shit about non-regular files.
That's probably true. But have you tried sending without Content-Length
and Connection: close and closing the output side of the socket before
starting to read the reply from the server?
//Stefan




Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder [EMAIL PROTECTED] writes:

 I guess, you as the wget maintainer switched from something
 supported to the unsupported betaX scheme and now we have
 something to talk about ;)

I had no idea that something as usual as betaX was unsupported.  In
fact, I believe that bX was added when Francois saw me using it in
Wget.  :-)

 Using something different then exactly wget-1.9-b3.de.po will
 confuse the robot

sigh

 Returning an error that says your version number is unparsable to
 this piece of software, you must use one of ... instead would be
 more correct in the long run.

 Sure.  You should have receive a message like this, didn't you?

I didn't.  Maybe it was an artifact of robot not having worked at the
time, though.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic
Stefan Eissing [EMAIL PROTECTED] writes:

 Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje
 Niksic:
 What the current code does is: determine the file size, send
 Content-Length, read the file in chunks (up to the promised size) and
 send those chunks to the server.  But that works only with regular
 files.  It would be really nice to be able to say something like:

 mkisofs blabla | wget http://burner/localburn.cgi --post-file
 /dev/stdin

 That would indeed be nice. Since I'm coming from the WebDAV side
 of life: does wget allow the use of PUT?

No.

 I haven't checked, but I'm 99% convinced that browsers simply don't
 give a shit about non-regular files.

 That's probably true. But have you tried sending without
 Content-Length and Connection: close and closing the output side of
 the socket before starting to read the reply from the server?

That might work, but it sounds too dangerous to do by default, and too
obscure to devote a command-line option to.  Besides, HTTP/1.1
*requires* requests with a request-body to provide Conent-Length:

   For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
   containing a message-body MUST include a valid Content-Length
   header field unless the server is known to be HTTP/1.1 compliant.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Stefan Eissing
Am Dienstag, 07.10.03, um 17:02 Uhr (Europe/Berlin) schrieb Hrvoje 
Niksic:
That's probably true. But have you tried sending without
Content-Length and Connection: close and closing the output side of
the socket before starting to read the reply from the server?
That might work, but it sounds too dangerous to do by default, and too
obscure to devote a command-line option to.  Besides, HTTP/1.1
*requires* requests with a request-body to provide Conent-Length:
   For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
   containing a message-body MUST include a valid Content-Length
   header field unless the server is known to be HTTP/1.1 compliant.
I just checked with RFC 1945 and it explicitly says that POSTs must
carry a valid Content-Length header.
That leaves the option of first sending an OPTIONS request to the
server (either url or *) to check the HTTP version.
//Stefan




[PATCH] wget-1.8.2: Portability, plus EBCDIC patch

2003-10-07 Thread Martin Kraemer
Hello Hrvoje and Dan,

I have been using wget for many years now, and finally got to applying
a patch I made long ago (EBCDIC patch against wget-1.5.3) to the
current wget-1.8.2. This patch makes wget compile and run on a
mainframe computer using the EBCDIC character set.

Also, when compiling wget on Solaris (using the SUNWspro Forte
compiler), I stumbled over a portability problem (C++ comments in a 
C source) to which I add a patch as well.

About the EBCDIC patch:
* The goal was to create a patch which worked for our EBCDIC system
  (Fujitsu-Siemens' mainframe OS is called BS2000, it runs on /390
  hardware, but is not compatible with OS/390 per se) but would be
  easily adaptable to OS/390 (to which I have no access, but whose
  behaviour I know from similar ports). The code to actually make
  it work for OS/390 is not in place, but I added a tool (called
  safe-ctype-mk.c -- delete if you don't like it) to create the
  additions to safe-ctype.c which are necessary because IBM's
  EBCDIC differs from our EBCDIC.

* Because code conversion is necessary for text files, a distiction
  between text and binary download was added (based on the
  downloaded MIME type; see the routines http_set_convert_flag() and
  http_get_convert_flag(). A future patch may add a new
  --conversion=text/binary/auto switch which is not implemented
  yet.)  Currently, the same heuristics are used as in the Apache
  HTTP server to determine whether conversion is required (for
  several kinds of text files) or not required (for images,
  compressed files etc.)

* Because EBCDIC alphabetic characters live in the range between
  '\xA1' and '\xE9', the getopt_long() numbers have been shifted up
  by 200, beyond the 0xFF boundary, to avoid conflicts between
  single-character options and numeric long-option values. That does
  not change the behaviour on ASCII machines, but allows the source
  to compile on EBCDIC machines (otherwise: error: multiple case in
  switch).

* wget-1.8.2 has been compiled on our BS2000, with the patch applied,
  and with SSL enabled (against openssl-0.9.6k), and has been tested
  to work correctly.

If you would add the patch to future versions of wget, then all
users of our BS2000 as well as users of IBM's OS/390 could take
advantage of the availability of wget for EBCDIC-based machines, and
hopefully someone would also contribute the missing IBM-EBCDIC
counterparts to our BS2000-EBCDIC patch.

  Martin
-- 
[EMAIL PROTECTED] | Fujitsu Siemens
Fon: +49-89-636-46021, FAX: +49-89-636-47655 | 81730  Munich,  Germany
diff -bur wget-1.8.2/src/ftp.c work/wget-1.8.2/src/ftp.c
--- wget-1.8.2/src/ftp.c.orig   2003-10-06 17:20:58.710178000 +0200
+++ wget-1.8.2/src/ftp.c2003-10-06 17:17:00.399371000 +0200
@@ -474,7 +474,7 @@
}
 
   err = ftp_size(con-rbuf, u-file, len);
-//  printf(\ndebug: %lld\n, *len);
+/*  printf(\ndebug: %lld\n, *len); */
   /* FTPRERR */
   switch (err)
{
diff -bur wget-1.8.2/src/http.c work/wget-1.8.2/src/http.c
--- wget-1.8.2/src/http.c.orig  2003-10-06 17:20:58.900182000 +0200
+++ wget-1.8.2/src/http.c   2003-10-06 17:19:16.829836000 +0200
@@ -1777,7 +1777,7 @@
  FREE_MAYBE (dummy);
  return RETROK;
}
-//  fprintf(stderr, test: hstat.len: %lld, hstat.restval: %lld\n, hstat.dltime);
+/*  fprintf(stderr, test: hstat.len: %lld, hstat.restval: %lld\n, 
hstat.dltime); */
   tmrate = retr_rate (hstat.len - hstat.restval, hstat.dltime, 0);
 
   if (hstat.len == hstat.contlen)
diff -bur wget-1.8.2.orig/src/connect.c wget-1.8.2/src/connect.c
--- wget-1.8.2.orig/src/connect.c   Mon Oct  6 17:13:11 2003
+++ wget-1.8.2/src/connect.cMon Oct  6 17:10:28 2003
@@ -47,6 +47,10 @@
 #endif
 #endif /* WINDOWS */
 
+#if #system(bs2000)
+#include ascii_ebcdic.h
+#endif
+
 #include errno.h
 #ifdef HAVE_STRING_H
 # include string.h
@@ -73,6 +77,26 @@
to connect_to_one.  */
 static const char *connection_host_name;
 
+#if 'A' == '\xC1' /* CHARSET_EBCDIC */
+/* Start off with convert=1 (headers are always converted) */
+static int convert_flag_last_reply = 1;
+
+void
+http_set_convert_flag(const char *type)
+{
+convert_flag_last_reply = 
+   (strncasecmp(type, text/, 5) == 0 
+   || strncasecmp(type, message/, 8) == 0 
+   || strcasecmp(type, application/postscript) == 0);
+}
+
+int
+http_get_convert_flag()
+{
+return convert_flag_last_reply;
+}
+#endif
+ 
 void
 set_connection_host_name (const char *host)
 {
@@ -459,6 +483,11 @@
 }
   while (res == -1  errno == EINTR);
 
+#if 'A' == '\xC1'
+  if (res  0  http_get_convert_flag())
+_a2e_n(buf,res);
+#endif
+
   return res;
 }
 
@@ -472,6 +501,25 @@
 {
   int res = 0;
 
+#if 'A' == '\xC1' /* CHARSET_EBCDIC */
+  static char *cbuf = NULL;
+  static int csize = 0;
+
+  if (len  csize) {
+if (cbuf != NULL)
+  free(cbuf);
+cbuf = malloc(csize = len+8192); /* add arbitrary amount of skew */
+if 

Re: [PATCH] wget-1.8.2: Portability, plus EBCDIC patch

2003-10-07 Thread Hrvoje Niksic
Martin, thanks for the patch and the detailed report.  Note that it
might have made more sense to apply the patch to the latest CVS
version, which is somewhat different from 1.8.2.

I'm really not sure whether to add this patch.  On the one hand, it's
nice to support as many architectures as possible.  But on the other
hand, most systems are ASCII.  All the systems I've ever seen or
worked on have been ASCII.  I am fairly certain that I would not be
able to support EBCDIC in the long run and that, unless someone were
to continually support EBCDIC, the existing support would bitrot away.

Is anyone on the Wget list using an EBCDIC system?


Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Josh Brooks

Hello,

I have noticed very unpredictable behavior from wget 1.8.2 - specifically
I have noticed two things:

a) sometimes it does not follow all of the links it should

b) sometimes wget will follow links to other sites and URLs - when the
command line used should not allow it to do that.


Here are the details.


First, sometimes when you attempt to download a site with -k -m
(--convert-links and --mirror) wget will not follow all of the links and
will skip some of the files!

I have no idea why it does this with some sites and doesn't do it with
other sites.  Here is an example that I have reproduced on several systems
- all with 1.8.2:

# wget -k -m http://www.zorg.org/vsound/
--17:09:32--  http://www.zorg.org/vsound/
   = `www.zorg.org/vsound/index.html'
Resolving www.zorg.org... done.
Connecting to www.zorg.org[213.232.100.31]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[  =
] 12,23553.82K/s

Last-modified header missing -- time-stamps turned off.
17:09:32 (53.82 KB/s) - `www.zorg.org/vsound/index.html' saved [12235]


FINISHED --17:09:32--
Downloaded: 12,235 bytes in 1 files
Converting www.zorg.org/vsound/index.html... 2-6
Converted 1 files in 0.03 seconds.


What is the problem here ?  When I run the exact same command line with
wget 1.6, I get this:


# wget -k -m http://www.zorg.org/vsound/
--11:10:06--  http://www.zorg.org/vsound/
   = `www.zorg.org/vsound/index.html'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

0K - .. .

Last-modified header missing -- time-stamps turned off.
11:10:07 (71.12 KB/s) - `www.zorg.org/vsound/index.html' saved [12235]

Loading robots.txt; please ignore errors.
--11:10:07--  http://www.zorg.org/robots.txt
   = `www.zorg.org/robots.txt'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 404 Not Found
11:10:07 ERROR 404: Not Found.

--11:10:07--  http://www.zorg.org/vsound/vsound.jpg
   = `www.zorg.org/vsound/vsound.jpg'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 27,629 [image/jpeg]

0K - .. .. ..   [100%]

11:10:08 (51.49 KB/s) - `www.zorg.org/vsound/vsound.jpg' saved
[27629/27629]

--11:10:09--  http://www.zorg.org/vsound/vsound-0.2.tar.gz
   = `www.zorg.org/vsound/vsound-0.2.tar.gz'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 108,987 [application/x-tar]

0K - .. .. .. .. .. [ 46%]
   50K - .. .. .. .. .. [ 93%]
  100K - .. [100%]

11:10:12 (46.60 KB/s) - `www.zorg.org/vsound/vsound-0.2.tar.gz' saved
[108987/108987]

--11:10:12--  http://www.zorg.org/vsound/vsound-0.5.tar.gz
   = `www.zorg.org/vsound/vsound-0.5.tar.gz'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 116,904 [application/x-tar]

0K - .. .. .. .. .. [ 43%]
   50K - .. .. .. .. .. [ 87%]
  100K - .. [100%]

11:10:14 (60.44 KB/s) - `www.zorg.org/vsound/vsound-0.5.tar.gz' saved
[116904/116904]

--11:10:14--  http://www.zorg.org/vsound/vsound
   = `www.zorg.org/vsound/vsound'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 3,365 [text/plain]

0K - ...[100%]

11:10:14 (3.21 MB/s) - `www.zorg.org/vsound/vsound' saved [3365/3365]

Converting www.zorg.org/vsound/index.html... done.

FINISHED --11:10:14--
Downloaded: 269,120 bytes in 5 files
Converting www.zorg.org/vsound/index.html... done.


See ?  It gets the links inside of index.html, and mirrors those links,
and converts them - just like it should.  Why does 1.8.2 have a problem
with this site ?  Other sites are handled just fine by 1.8.2 with the same
command line ... it makes no sense that wget 1.8.2 has problems with
particular web sites.

This is incorrect behavior - and if you try the same URL with 1.8.2 you
can reproduce the same results.




The second problem, and I cannot currently give you an example to try
yourself but _it does happen_, is if you use this command line:

wget --tries=inf -nH --no-parent
--directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf
--convert-links --html-extension --user-agent=Mozilla/4.0 (compatible;
MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com

At first it will act normally, just going over the site in question, but
sometimes, you will come back to the terminal and see if grabbing all
sorts of pages from totally different sites (!)  I have seen this happen

Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't understand what you're proposing.  Reading the whole file in
 memory is too memory-intensive for large files (one could presumably
 POST really huge files, CD images or whatever).

I was proposing that you read the file to determine the length, but that was
on the assumption that you could read the input twice, which won't work with
the example you proposed.

 It would be really nice to be able to say something like:

 mkisofs blabla | wget http://burner/localburn.cgi --post-file
 /dev/stdin

Stefan Eissing wrote:

 I just checked with RFC 1945 and it explicitly says that POSTs must
 carry a valid Content-Length header.

In that case, Hrvoje will need to get creative. :-)

Can you determine if --post-file is a regular file? If so, I still think you
should just read (or otherwise examine) the file to determine the length.

For other types of input, perhaps you want write the input to a temporary
file.

Tony



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:

 I don't understand what you're proposing.  Reading the whole file in
 memory is too memory-intensive for large files (one could presumably
 POST really huge files, CD images or whatever).

 I was proposing that you read the file to determine the length, but
 that was on the assumption that you could read the input twice,
 which won't work with the example you proposed.

In fact, it won't work with anything except regular files and links to
them.

 Can you determine if --post-file is a regular file?

Yes.

 If so, I still think you should just read (or otherwise examine) the
 file to determine the length.

That's how --post-file works now.  The problem is that it doesn't work
for non-regular files.  My first message explains it, or at least
tries to.

 For other types of input, perhaps you want write the input to a
 temporary file.

That would work for short streaming, but would be pretty bad in the
mkisofs example.  One would expect Wget to be able to stream the data
to the server, and that's just not possible if the size needs to be
known in advance, which HTTP/1.0 requires.


Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Hrvoje Niksic
Josh Brooks [EMAIL PROTECTED] writes:

 I have noticed very unpredictable behavior from wget 1.8.2 -
 specifically I have noticed two things:

 a) sometimes it does not follow all of the links it should

 b) sometimes wget will follow links to other sites and URLs - when the
 command line used should not allow it to do that.

Thanks for the report.  A more detailed response follows below:

 First, sometimes when you attempt to download a site with -k -m
 (--convert-links and --mirror) wget will not follow all of the links and
 will skip some of the files!

 I have no idea why it does this with some sites and doesn't do it with
 other sites.  Here is an example that I have reproduced on several systems
 - all with 1.8.2:

Links are missed on some sites because of the use of incorrect
comments.  This has been fixed for Wget 1.9, where a more relaxed
comment parsing code is the default.  But that's not the case for
www.zorg.org/vsound/.

www.zorg.org/vsound/ contains this markup:

META NAME=ROBOTS  CONTENT=NOFOLLOW

That explicitly tells robots, such as Wget, not to follow the links in
the page.  Wget respects this and does not follow the links.  You can
tell Wget to ignore the robot directives.  For me, this works as
expected:

wget -km -e robots=off http://www.zorg.org/vsound/

You can put `robots=off' in your .wgetrc and this problem will not
bother you again.

 The second problem, and I cannot currently give you an example to try
 yourself but _it does happen_, is if you use this command line:

 wget --tries=inf -nH --no-parent
 --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf
 --convert-links --html-extension --user-agent=Mozilla/4.0 (compatible;
 MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com

 At first it will act normally, just going over the site in question, but
 sometimes, you will come back to the terminal and see if grabbing all
 sorts of pages from totally different sites (!)

The only way I've seen it happen is when it follows a redirection to a
different site.  The redirection is followed because it's considered
to be part of the same download.  However, further links on the
redirected site are not (supposed to be) followed.

If you have a repeatable example, please mail it here so we can
examine it in more detail.


Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Josh Brooks

Thank you for the great response.  It is much appreciated - see below...

On Tue, 7 Oct 2003, Hrvoje Niksic wrote:

 www.zorg.org/vsound/ contains this markup:

 META NAME=ROBOTSCONTENT=NOFOLLOW

 That explicitly tells robots, such as Wget, not to follow the links in
 the page.  Wget respects this and does not follow the links.  You can
 tell Wget to ignore the robot directives.  For me, this works as
 expected:

 wget -km -e robots=off http://www.zorg.org/vsound/

Perfect - thank you.


  At first it will act normally, just going over the site in question, but
  sometimes, you will come back to the terminal and see if grabbing all
  sorts of pages from totally different sites (!)

 The only way I've seen it happen is when it follows a redirection to a
 different site.  The redirection is followed because it's considered
 to be part of the same download.  However, further links on the
 redirected site are not (supposed to be) followed.

Ok, is there a way to tell wget not to follow redirects, so it will not
ever do that at all ?  Basically I am looking for a way to tell wget
don't ever get anything with a different FQDN than what I started you
with

thanks.



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

 That would work for short streaming, but would be pretty bad in the
 mkisofs example.  One would expect Wget to be able to stream the data
 to the server, and that's just not possible if the size needs to be
 known in advance, which HTTP/1.0 requires.

One might expect it, but if it's not possible using the HTTP protocol, what
can you do? :-)



Re: Web page source using wget?

2003-10-07 Thread Suhas Tembe
Thanks everyone for the replies so far..

The problem I am having is that the customer is using ASP  Java script. The URL stays 
the same as I click through the links. So, using wget URL for the page I want may 
not work (I may be wrong). Any suggestions on how I can tackle this?

Thanks,
Suhas

- Original Message - 
From: Hrvoje Niksic [EMAIL PROTECTED]
To: Suhas Tembe [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, October 06, 2003 5:19 PM
Subject: Re: Web page source using wget?


 Suhas Tembe [EMAIL PROTECTED] writes:
 
  Hello Everyone,
 
  I am new to this wget utility, so pardon my ignorance.. Here is a
  brief explanation of what I am currently doing:
 
  1). I go to our customer's website every day  log in using a User Name  Password.
  2). I click on 3 links before I get to the page I want.
  3). I right-click on the page  choose view source. It opens it up in Notepad.
  4). I save the source to a file  subsequently perform various tasks on that 
  file.
 
  As you can see, it is a manual process. What I would like to do is
  automate this process of obtaining the source of a page using
  wget. Is this possible? Maybe you can give me some suggestions.
 
 It's possible, in fact it's what Wget does in its most basic form.
 Disregarding authentication, the recipe would be:
 
 1) Write down the URL.
 
 2) Type `wget URL' and you get the source of the page in file named
SOMETHING.html, where SOMETHING is the file name that the URL ends
with.
 
 Of course, you will also have to specify the credentials to the page,
 and Tony explained how to do that.
 



Re: Web page source using wget?

2003-10-07 Thread Hrvoje Niksic
Suhas Tembe [EMAIL PROTECTED] writes:

 Thanks everyone for the replies so far..

 The problem I am having is that the customer is using ASP  Java
 script. The URL stays the same as I click through the links.

URL staying the same is usually a sign of the use of frame, not of ASP
and JavaScript.  Instead of looking at the URL entry field, try using
copy link to clipboard instead of clicking on the last link.  Then
use Wget on that.



Re: Web page source using wget?

2003-10-07 Thread Suhas Tembe
Got it! Thanks! So far so good. After logging-in, I was able to get to the page I am 
interested in. There was one thing that I forgot to mention in my earlier posts (I 
apologize)... this page contains a drop-down list of our customer's locations. At 
present, I choose one location from the drop-down list  click submit to get the 
data, which is displayed in a report format. I right-click  then choose view 
source  save source to a file. I then choose the next location from the 
drop-down list, click submit again. I again do a view source  save the source to 
another file and so on for all their locations.

I am not quite sure how to automate this process! How can I do this non-interactively? 
especially the submit portion of the page. Is this possible using wget?

Thanks,
Suhas

- Original Message - 
From: Hrvoje Niksic [EMAIL PROTECTED]
To: Suhas Tembe [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, October 07, 2003 5:02 PM
Subject: Re: Web page source using wget?


 Suhas Tembe [EMAIL PROTECTED] writes:
 
  Thanks everyone for the replies so far..
 
  The problem I am having is that the customer is using ASP  Java
  script. The URL stays the same as I click through the links.
 
 URL staying the same is usually a sign of the use of frame, not of ASP
 and JavaScript.  Instead of looking at the URL entry field, try using
 copy link to clipboard instead of clicking on the last link.  Then
 use Wget on that.
 



Re: Web page source using wget?

2003-10-07 Thread Hrvoje Niksic
Suhas Tembe [EMAIL PROTECTED] writes:

 this page contains a drop-down list of our customer's locations.
 At present, I choose one location from the drop-down list  click
 submit to get the data, which is displayed in a report format. I
 right-click  then choose view source  save source to a file.
 I then choose the next location from the drop-down list, click
 submit again. I again do a view source  save the source to
 another file and so on for all their locations.

It's possible to automate this, but it requires some knowledge of
HTML.  Basically, you need to look at the form.../form part of the
page and find the select tag that defines the drop-down.  Assuming
that the form looks like this:

form action=http://foo.com/customer; method=GET
  select name=location
option value=caCalifornia
option value=maMassachussetts
...
  /select
/form

you'd automate getting the locations by doing something like:

for loc in ca ma ...
do
  wget http://foo.com/customer?location=$loc;
done

Wget will save the respective sources in files named
customer?location=ca, customer?location=ma, etc.

But this was only an example.  The actual process depends on what's in
the form, and it might be considerably more complex than this.



Re: Web page source using wget?

2003-10-07 Thread Suhas Tembe
It does look a little complicated This is how it looks:

form action=InventoryStatus.asp method=post name=select onsubmit=return 
select_validate(); style=margin:0
div style=margin-top:10px
table border=1 bordercolor=#d9d9d9 bordercolordark=#ff 
bordercolorlight=#d9d9d9 cellpadding=3 cellspacing=0 width=100%
tr
td style=font-weight:bold;color:black;background-color:#CC;text-align:right 
width=20%nobrSuppliernbsp;/nobr/td
td style=color:black;background-color:#F0;text-align:left 
colspan=2nobrselect name=cboSupplieroption value=4541-134289454A/option
option value=4542-134289 selected454B/option/select img id=cboSupplier_icon 
name=cboSupplier_icon src=../images/required.gif alt=*/nobr/td
/tr
tr
td style=font-weight:bold;color:black;background-color:#CC;text-align:right 
width=20%nobrQuantity Statusnbsp;/nobr/td
td style=color:black;background-color:#F0;text-align:left colspan=2
table border=0 cellpadding=0 cellspacing=0
tr
td
table border=0
tr
td width=1input id=choice_IDAMCB3B name=status type=radio value=over/td
td style=color:black;background-color:#F0;text-align:leftspan 
onclick=choice_IDAMCB3B.checked=true; Over/span/td
td width=1input id=choice_IDARCB3B name=status type=radio 
value=under/td
td style=color:black;background-color:#F0;text-align:leftspan 
onclick=choice_IDARCB3B.checked=true; Under/span/td
td width=1input id=choice_IDAWCB3B name=status type=radio value=both/td
td style=color:black;background-color:#F0;text-align:leftspan 
onclick=choice_IDAWCB3B.checked=true; Both/span/td
td width=1input id=choice_IDA1CB3B name=status type=radio value=all 
checked/td
td style=color:black;background-color:#F0;text-align:leftspan 
onclick=choice_IDA1CB3B.checked=true; All/span/td
/tr
/table
/td
td img id=status_icon name=status_icon src=../images/blank.gif alt=/td
/tr
/table
/td
/tr
tr
td style=font-weight:bold;color:black;background-color:#CCnbsp;/td
td colspan=2 
style=font-weight:bold;color:black;background-color:#CC;text-align:leftinput 
type=submit name=action-select value=Query onclick=doValidate = true; /td
/tr
/table
/div
/form


I don't see any specific URL that would get the relevant data after I hit submit. 
Maybe I am missing something...

Thanks,
Suhas


- Original Message - 
From: Hrvoje Niksic [EMAIL PROTECTED]
To: Suhas Tembe [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, October 07, 2003 5:24 PM
Subject: Re: Web page source using wget?


 Suhas Tembe [EMAIL PROTECTED] writes:
 
  this page contains a drop-down list of our customer's locations.
  At present, I choose one location from the drop-down list  click
  submit to get the data, which is displayed in a report format. I
  right-click  then choose view source  save source to a file.
  I then choose the next location from the drop-down list, click
  submit again. I again do a view source  save the source to
  another file and so on for all their locations.
 
 It's possible to automate this, but it requires some knowledge of
 HTML.  Basically, you need to look at the form.../form part of the
 page and find the select tag that defines the drop-down.  Assuming
 that the form looks like this:
 
 form action=http://foo.com/customer; method=GET
   select name=location
 option value=caCalifornia
 option value=maMassachussetts
 ...
   /select
 /form
 
 you'd automate getting the locations by doing something like:
 
 for loc in ca ma ...
 do
   wget http://foo.com/customer?location=$loc;
 done
 
 Wget will save the respective sources in files named
 customer?location=ca, customer?location=ma, etc.
 
 But this was only an example.  The actual process depends on what's in
 the form, and it might be considerably more complex than this.
 



Re: Web page source using wget?

2003-10-07 Thread Hrvoje Niksic
Suhas Tembe [EMAIL PROTECTED] writes:

 It does look a little complicated This is how it looks:

 form action=InventoryStatus.asp method=post [...]
[...]
 select name=cboSupplier
 option value=4541-134289454A/option
 option value=4542-134289 selected454B/option
 /select

Those are the important parts.  It's not hard to submit this form.
With Wget 1.9, you can even use the POST method, e.g.:

wget http://.../InventoryStatus.asp --post-data \
 'cboSupplier=4541-134289status=allaction-select=Query' \
 -O InventoryStatus1.asp
wget http://.../InventoryStatus.asp --post-data \
 'cboSupplier=4542-134289status=allaction-select=Query'
 -O InventoryStatus2.asp

It might even work to simply use GET, and retrieve
http://.../InventoryStatus.asp?cboSupplier=4541-134289status=allaction-select=Query
without the need for `--post-data' or `-O', but that depends on the
ASP script that does the processing.

The harder part is to automate this process for *any* values in the
drop-down list.  You might need to use an intermediary Perl script
that extracts all the option value=... from the HTML source of the
page with the drop-down.  Then, from the output of the Perl script,
you call Wget as shown above.

It's doable, but it takes some work.  Unfortunately, I don't know of a
(command-line) tool that would make this easier.