Re: Web page "source" using wget?

2003-10-07 Thread Hrvoje Niksic
"Suhas Tembe" <[EMAIL PROTECTED]> writes:

> It does look a little complicated This is how it looks:
>
>  
> 454A
> 454B
> 

Those are the important parts.  It's not hard to submit this form.
With Wget 1.9, you can even use the POST method, e.g.:

wget http://.../InventoryStatus.asp --post-data \
 'cboSupplier=4541-134289&status=all&action-select=Query' \
 -O InventoryStatus1.asp
wget http://.../InventoryStatus.asp --post-data \
 'cboSupplier=4542-134289&status=all&action-select=Query'
 -O InventoryStatus2.asp

It might even work to simply use GET, and retrieve
http://.../InventoryStatus.asp?cboSupplier=4541-134289&status=all&action-select=Query
without the need for `--post-data' or `-O', but that depends on the
ASP script that does the processing.

The harder part is to automate this process for *any* values in the
drop-down list.  You might need to use an intermediary Perl script
that extracts all the  from the HTML source of the
page with the drop-down.  Then, from the output of the Perl script,
you call Wget as shown above.

It's doable, but it takes some work.  Unfortunately, I don't know of a
(command-line) tool that would make this easier.



Re: Web page "source" using wget?

2003-10-07 Thread Suhas Tembe
It does look a little complicated This is how it looks:





Supplier 
454A
454B 


Quantity Status 







Over

Under

Both

All



 





 
 






I don't see any specific URL that would get the relevant data after I hit submit. 
Maybe I am missing something...

Thanks,
Suhas


- Original Message - 
From: "Hrvoje Niksic" <[EMAIL PROTECTED]>
To: "Suhas Tembe" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, October 07, 2003 5:24 PM
Subject: Re: Web page "source" using wget?


> "Suhas Tembe" <[EMAIL PROTECTED]> writes:
> 
> > this page contains a "drop-down" list of our customer's locations.
> > At present, I choose one location from the "drop-down" list & click
> > submit to get the data, which is displayed in a report format. I
> > "right-click" & then choose "view source" & save "source" to a file.
> > I then choose the next location from the "drop-down" list, click
> > submit again. I again do a "view source" & save the source to
> > another file and so on for all their locations.
> 
> It's possible to automate this, but it requires some knowledge of
> HTML.  Basically, you need to look at the ... part of the
> page and find the  tag that defines the drop-down.  Assuming
> that the form looks like this:
> 
> http://foo.com/customer"; method=GET>
>   
> California
> Massachussetts
> ...
>   
> 
> 
> you'd automate getting the locations by doing something like:
> 
> for loc in ca ma ...
> do
>   wget "http://foo.com/customer?location=$loc";
> done
> 
> Wget will save the respective sources in files named
> "customer?location=ca", "customer?location=ma", etc.
> 
> But this was only an example.  The actual process depends on what's in
> the form, and it might be considerably more complex than this.
> 



Re: Web page "source" using wget?

2003-10-07 Thread Hrvoje Niksic
"Suhas Tembe" <[EMAIL PROTECTED]> writes:

> this page contains a "drop-down" list of our customer's locations.
> At present, I choose one location from the "drop-down" list & click
> submit to get the data, which is displayed in a report format. I
> "right-click" & then choose "view source" & save "source" to a file.
> I then choose the next location from the "drop-down" list, click
> submit again. I again do a "view source" & save the source to
> another file and so on for all their locations.

It's possible to automate this, but it requires some knowledge of
HTML.  Basically, you need to look at the ... part of the
page and find the  tag that defines the drop-down.  Assuming
that the form looks like this:

http://foo.com/customer"; method=GET>
  
California
Massachussetts
...
  


you'd automate getting the locations by doing something like:

for loc in ca ma ...
do
  wget "http://foo.com/customer?location=$loc";
done

Wget will save the respective sources in files named
"customer?location=ca", "customer?location=ma", etc.

But this was only an example.  The actual process depends on what's in
the form, and it might be considerably more complex than this.



Re: Web page "source" using wget?

2003-10-07 Thread Suhas Tembe
Got it! Thanks! So far so good. After logging-in, I was able to get to the page I am 
interested in. There was one thing that I forgot to mention in my earlier posts (I 
apologize)... this page contains a "drop-down" list of our customer's locations. At 
present, I choose one location from the "drop-down" list & click submit to get the 
data, which is displayed in a report format. I "right-click" & then choose "view 
source" & save "source" to a file. I then choose the next location from the 
"drop-down" list, click submit again. I again do a "view source" & save the source to 
another file and so on for all their locations.

I am not quite sure how to automate this process! How can I do this non-interactively? 
especially the "submit" portion of the page. Is this possible using wget?

Thanks,
Suhas

- Original Message - 
From: "Hrvoje Niksic" <[EMAIL PROTECTED]>
To: "Suhas Tembe" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tuesday, October 07, 2003 5:02 PM
Subject: Re: Web page "source" using wget?


> "Suhas Tembe" <[EMAIL PROTECTED]> writes:
> 
> > Thanks everyone for the replies so far..
> >
> > The problem I am having is that the customer is using ASP & Java
> > script. The URL stays the same as I click through the links.
> 
> URL staying the same is usually a sign of the use of frame, not of ASP
> and JavaScript.  Instead of looking at the URL entry field, try using
> "copy link to clipboard" instead of clicking on the last link.  Then
> use Wget on that.
> 



Re: Web page "source" using wget?

2003-10-07 Thread Hrvoje Niksic
"Suhas Tembe" <[EMAIL PROTECTED]> writes:

> Thanks everyone for the replies so far..
>
> The problem I am having is that the customer is using ASP & Java
> script. The URL stays the same as I click through the links.

URL staying the same is usually a sign of the use of frame, not of ASP
and JavaScript.  Instead of looking at the URL entry field, try using
"copy link to clipboard" instead of clicking on the last link.  Then
use Wget on that.



Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Hrvoje Niksic
Josh Brooks <[EMAIL PROTECTED]> writes:

>> > At first it will act normally, just going over the site in question, but
>> > sometimes, you will come back to the terminal and see if grabbing all
>> > sorts of pages from totally different sites (!)
>>
>> The only way I've seen it happen is when it follows a redirection to a
>> different site.  The redirection is followed because it's considered
>> to be part of the same download.  However, further links on the
>> redirected site are not (supposed to be) followed.
>
> Ok, is there a way to tell wget not to follow redirects, so it will
> not ever do that at all ?

Not yet, sorry.  But people have asked for it a lot, so it'll probably
make it in after 1.9.



Re: Web page "source" using wget?

2003-10-07 Thread Suhas Tembe
Thanks everyone for the replies so far..

The problem I am having is that the customer is using ASP & Java script. The URL stays 
the same as I click through the links. So, using "wget URL" for the page I want may 
not work (I may be wrong). Any suggestions on how I can tackle this?

Thanks,
Suhas

- Original Message - 
From: "Hrvoje Niksic" <[EMAIL PROTECTED]>
To: "Suhas Tembe" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, October 06, 2003 5:19 PM
Subject: Re: Web page "source" using wget?


> "Suhas Tembe" <[EMAIL PROTECTED]> writes:
> 
> > Hello Everyone,
> >
> > I am new to this wget utility, so pardon my ignorance.. Here is a
> > brief explanation of what I am currently doing:
> >
> > 1). I go to our customer's website every day & log in using a User Name & Password.
> > 2). I click on 3 links before I get to the page I want.
> > 3). I right-click on the page & choose "view source". It opens it up in Notepad.
> > 4). I save the "source" to a file & subsequently perform various tasks on that 
> > file.
> >
> > As you can see, it is a manual process. What I would like to do is
> > automate this process of obtaining the "source" of a page using
> > wget. Is this possible? Maybe you can give me some suggestions.
> 
> It's possible, in fact it's what Wget does in its most basic form.
> Disregarding authentication, the recipe would be:
> 
> 1) Write down the URL.
> 
> 2) Type `wget URL' and you get the source of the page in file named
>SOMETHING.html, where SOMETHING is the file name that the URL ends
>with.
> 
> Of course, you will also have to specify the credentials to the page,
> and Tony explained how to do that.
> 



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

> That would work for short streaming, but would be pretty bad in the
> mkisofs example.  One would expect Wget to be able to stream the data
> to the server, and that's just not possible if the size needs to be
> known in advance, which HTTP/1.0 requires.

One might expect it, but if it's not possible using the HTTP protocol, what
can you do? :-)



Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Josh Brooks

Thank you for the great response.  It is much appreciated - see below...

On Tue, 7 Oct 2003, Hrvoje Niksic wrote:

> www.zorg.org/vsound/ contains this markup:
>
> 
>
> That explicitly tells robots, such as Wget, not to follow the links in
> the page.  Wget respects this and does not follow the links.  You can
> tell Wget to ignore the robot directives.  For me, this works as
> expected:
>
> wget -km -e robots=off http://www.zorg.org/vsound/

Perfect - thank you.


> > At first it will act normally, just going over the site in question, but
> > sometimes, you will come back to the terminal and see if grabbing all
> > sorts of pages from totally different sites (!)
>
> The only way I've seen it happen is when it follows a redirection to a
> different site.  The redirection is followed because it's considered
> to be part of the same download.  However, further links on the
> redirected site are not (supposed to be) followed.

Ok, is there a way to tell wget not to follow redirects, so it will not
ever do that at all ?  Basically I am looking for a way to tell wget
"don't ever get anything with a different FQDN than what I started you
with"

thanks.



Re: Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Hrvoje Niksic
Josh Brooks <[EMAIL PROTECTED]> writes:

> I have noticed very unpredictable behavior from wget 1.8.2 -
> specifically I have noticed two things:
>
> a) sometimes it does not follow all of the links it should
>
> b) sometimes wget will follow links to other sites and URLs - when the
> command line used should not allow it to do that.

Thanks for the report.  A more detailed response follows below:

> First, sometimes when you attempt to download a site with -k -m
> (--convert-links and --mirror) wget will not follow all of the links and
> will skip some of the files!
>
> I have no idea why it does this with some sites and doesn't do it with
> other sites.  Here is an example that I have reproduced on several systems
> - all with 1.8.2:

Links are missed on some sites because of the use of incorrect
comments.  This has been fixed for Wget 1.9, where a more relaxed
comment parsing code is the default.  But that's not the case for
www.zorg.org/vsound/.

www.zorg.org/vsound/ contains this markup:



That explicitly tells robots, such as Wget, not to follow the links in
the page.  Wget respects this and does not follow the links.  You can
tell Wget to ignore the robot directives.  For me, this works as
expected:

wget -km -e robots=off http://www.zorg.org/vsound/

You can put `robots=off' in your .wgetrc and this problem will not
bother you again.

> The second problem, and I cannot currently give you an example to try
> yourself but _it does happen_, is if you use this command line:
>
> wget --tries=inf -nH --no-parent
> --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf
> --convert-links --html-extension --user-agent="Mozilla/4.0 (compatible;
> MSIE 6.0; AOL 7.0; Windows NT 5.1)" www.example.com
>
> At first it will act normally, just going over the site in question, but
> sometimes, you will come back to the terminal and see if grabbing all
> sorts of pages from totally different sites (!)

The only way I've seen it happen is when it follows a redirection to a
different site.  The redirection is followed because it's considered
to be part of the same download.  However, further links on the
redirected site are not (supposed to be) followed.

If you have a repeatable example, please mail it here so we can
examine it in more detail.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic wrote:
>
>> I don't understand what you're proposing.  Reading the whole file in
>> memory is too memory-intensive for large files (one could presumably
>> POST really huge files, CD images or whatever).
>
> I was proposing that you read the file to determine the length, but
> that was on the assumption that you could read the input twice,
> which won't work with the example you proposed.

In fact, it won't work with anything except regular files and links to
them.

> Can you determine if --post-file is a regular file?

Yes.

> If so, I still think you should just read (or otherwise examine) the
> file to determine the length.

That's how --post-file works now.  The problem is that it doesn't work
for non-regular files.  My first message explains it, or at least
tries to.

> For other types of input, perhaps you want write the input to a
> temporary file.

That would work for short streaming, but would be pretty bad in the
mkisofs example.  One would expect Wget to be able to stream the data
to the server, and that's just not possible if the size needs to be
known in advance, which HTTP/1.0 requires.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

> I don't understand what you're proposing.  Reading the whole file in
> memory is too memory-intensive for large files (one could presumably
> POST really huge files, CD images or whatever).

I was proposing that you read the file to determine the length, but that was
on the assumption that you could read the input twice, which won't work with
the example you proposed.

> It would be really nice to be able to say something like:
>
> mkisofs blabla | wget http://burner/localburn.cgi --post-file
> /dev/stdin

Stefan Eissing wrote:

> I just checked with RFC 1945 and it explicitly says that POSTs must
> carry a valid Content-Length header.

In that case, Hrvoje will need to get creative. :-)

Can you determine if --post-file is a regular file? If so, I still think you
should just read (or otherwise examine) the file to determine the length.

For other types of input, perhaps you want write the input to a temporary
file.

Tony



Major, and seemingly random problems with wget 1.8.2

2003-10-07 Thread Josh Brooks

Hello,

I have noticed very unpredictable behavior from wget 1.8.2 - specifically
I have noticed two things:

a) sometimes it does not follow all of the links it should

b) sometimes wget will follow links to other sites and URLs - when the
command line used should not allow it to do that.


Here are the details.


First, sometimes when you attempt to download a site with -k -m
(--convert-links and --mirror) wget will not follow all of the links and
will skip some of the files!

I have no idea why it does this with some sites and doesn't do it with
other sites.  Here is an example that I have reproduced on several systems
- all with 1.8.2:

# wget -k -m http://www.zorg.org/vsound/
--17:09:32--  http://www.zorg.org/vsound/
   => `www.zorg.org/vsound/index.html'
Resolving www.zorg.org... done.
Connecting to www.zorg.org[213.232.100.31]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[  <=>
] 12,23553.82K/s

Last-modified header missing -- time-stamps turned off.
17:09:32 (53.82 KB/s) - `www.zorg.org/vsound/index.html' saved [12235]


FINISHED --17:09:32--
Downloaded: 12,235 bytes in 1 files
Converting www.zorg.org/vsound/index.html... 2-6
Converted 1 files in 0.03 seconds.


What is the problem here ?  When I run the exact same command line with
wget 1.6, I get this:


# wget -k -m http://www.zorg.org/vsound/
--11:10:06--  http://www.zorg.org/vsound/
   => `www.zorg.org/vsound/index.html'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

0K -> .. .

Last-modified header missing -- time-stamps turned off.
11:10:07 (71.12 KB/s) - `www.zorg.org/vsound/index.html' saved [12235]

Loading robots.txt; please ignore errors.
--11:10:07--  http://www.zorg.org/robots.txt
   => `www.zorg.org/robots.txt'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 404 Not Found
11:10:07 ERROR 404: Not Found.

--11:10:07--  http://www.zorg.org/vsound/vsound.jpg
   => `www.zorg.org/vsound/vsound.jpg'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 27,629 [image/jpeg]

0K -> .. .. ..   [100%]

11:10:08 (51.49 KB/s) - `www.zorg.org/vsound/vsound.jpg' saved
[27629/27629]

--11:10:09--  http://www.zorg.org/vsound/vsound-0.2.tar.gz
   => `www.zorg.org/vsound/vsound-0.2.tar.gz'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 108,987 [application/x-tar]

0K -> .. .. .. .. .. [ 46%]
   50K -> .. .. .. .. .. [ 93%]
  100K -> .. [100%]

11:10:12 (46.60 KB/s) - `www.zorg.org/vsound/vsound-0.2.tar.gz' saved
[108987/108987]

--11:10:12--  http://www.zorg.org/vsound/vsound-0.5.tar.gz
   => `www.zorg.org/vsound/vsound-0.5.tar.gz'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 116,904 [application/x-tar]

0K -> .. .. .. .. .. [ 43%]
   50K -> .. .. .. .. .. [ 87%]
  100K -> .. [100%]

11:10:14 (60.44 KB/s) - `www.zorg.org/vsound/vsound-0.5.tar.gz' saved
[116904/116904]

--11:10:14--  http://www.zorg.org/vsound/vsound
   => `www.zorg.org/vsound/vsound'
Connecting to www.zorg.org:80... connected!
HTTP request sent, awaiting response... 200 OK
Length: 3,365 [text/plain]

0K -> ...[100%]

11:10:14 (3.21 MB/s) - `www.zorg.org/vsound/vsound' saved [3365/3365]

Converting www.zorg.org/vsound/index.html... done.

FINISHED --11:10:14--
Downloaded: 269,120 bytes in 5 files
Converting www.zorg.org/vsound/index.html... done.


See ?  It gets the links inside of index.html, and mirrors those links,
and converts them - just like it should.  Why does 1.8.2 have a problem
with this site ?  Other sites are handled just fine by 1.8.2 with the same
command line ... it makes no sense that wget 1.8.2 has problems with
particular web sites.

This is incorrect behavior - and if you try the same URL with 1.8.2 you
can reproduce the same results.




The second problem, and I cannot currently give you an example to try
yourself but _it does happen_, is if you use this command line:

wget --tries=inf -nH --no-parent
--directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf
--convert-links --html-extension --user-agent="Mozilla/4.0 (compatible;
MSIE 6.0; AOL 7.0; Windows NT 5.1)" www.example.com

At first it will act normally, just going over the site in question, but
sometimes, you will come back to the terminal and see if grabbing all
sorts of pages from totally different sites (!)  I have seen

Re: [PATCH] wget-1.8.2: Portability, plus EBCDIC patch

2003-10-07 Thread Hrvoje Niksic
Martin, thanks for the patch and the detailed report.  Note that it
might have made more sense to apply the patch to the latest CVS
version, which is somewhat different from 1.8.2.

I'm really not sure whether to add this patch.  On the one hand, it's
nice to support as many architectures as possible.  But on the other
hand, most systems are ASCII.  All the systems I've ever seen or
worked on have been ASCII.  I am fairly certain that I would not be
able to support EBCDIC in the long run and that, unless someone were
to continually support EBCDIC, the existing support would bitrot away.

Is anyone on the Wget list using an EBCDIC system?


[PATCH] wget-1.8.2: Portability, plus EBCDIC patch

2003-10-07 Thread Martin Kraemer
Hello Hrvoje and Dan,

I have been using wget for many years now, and finally got to applying
a patch I made long ago (EBCDIC patch against wget-1.5.3) to the
current wget-1.8.2. This patch makes wget compile and run on a
mainframe computer using the EBCDIC character set.

Also, when compiling wget on Solaris (using the SUNWspro "Forte"
compiler), I stumbled over a portability problem (C++ comments in a 
C source) to which I add a patch as well.

About the EBCDIC patch:
* The goal was to create a patch which worked for our EBCDIC system
  (Fujitsu-Siemens' mainframe OS is called BS2000, it runs on /390
  hardware, but is not compatible with OS/390 per se) but would be
  easily adaptable to OS/390 (to which I have no access, but whose
  behaviour I know from similar ports). The code to actually make
  it work for OS/390 is not in place, but I added a tool (called
  safe-ctype-mk.c -- delete if you don't like it) to create the
  additions to safe-ctype.c which are necessary because IBM's
  EBCDIC differs from "our" EBCDIC.

* Because code conversion is necessary for text files, a distiction
  between "text" and "binary" download was added (based on the
  downloaded MIME type; see the routines http_set_convert_flag() and
  http_get_convert_flag(). A future patch may add a new
  --conversion=text/binary/auto switch which is not implemented
  yet.)  Currently, the same heuristics are used as in the Apache
  HTTP server to determine whether conversion is required (for
  several kinds of text files) or not required (for images,
  compressed files etc.)

* Because EBCDIC alphabetic characters live in the range between
  '\xA1' and '\xE9', the getopt_long() numbers have been shifted up
  by 200, beyond the 0xFF boundary, to avoid conflicts between
  single-character options and numeric long-option values. That does
  not change the behaviour on ASCII machines, but allows the source
  to compile on EBCDIC machines (otherwise: error: multiple case in
  switch).

* wget-1.8.2 has been compiled on our BS2000, with the patch applied,
  and with SSL enabled (against openssl-0.9.6k), and has been tested
  to work correctly.

If you would add the patch to future versions of wget, then all
users of our BS2000 as well as users of IBM's OS/390 could take
advantage of the availability of wget for EBCDIC-based machines, and
hopefully someone would also contribute the missing IBM-EBCDIC
counterparts to our BS2000-EBCDIC patch.

  Martin
-- 
<[EMAIL PROTECTED]> | Fujitsu Siemens
Fon: +49-89-636-46021, FAX: +49-89-636-47655 | 81730  Munich,  Germany
diff -bur wget-1.8.2/src/ftp.c work/wget-1.8.2/src/ftp.c
--- wget-1.8.2/src/ftp.c.orig   2003-10-06 17:20:58.710178000 +0200
+++ wget-1.8.2/src/ftp.c2003-10-06 17:17:00.399371000 +0200
@@ -474,7 +474,7 @@
}
 
   err = ftp_size(&con->rbuf, u->file, len);
-//  printf("\ndebug: %lld\n", *len);
+/*  printf("\ndebug: %lld\n", *len); */
   /* FTPRERR */
   switch (err)
{
diff -bur wget-1.8.2/src/http.c work/wget-1.8.2/src/http.c
--- wget-1.8.2/src/http.c.orig  2003-10-06 17:20:58.900182000 +0200
+++ wget-1.8.2/src/http.c   2003-10-06 17:19:16.829836000 +0200
@@ -1777,7 +1777,7 @@
  FREE_MAYBE (dummy);
  return RETROK;
}
-//  fprintf(stderr, "test: hstat.len: %lld, hstat.restval: %lld\n", hstat.dltime);
+/*  fprintf(stderr, "test: hstat.len: %lld, hstat.restval: %lld\n", 
hstat.dltime); */
   tmrate = retr_rate (hstat.len - hstat.restval, hstat.dltime, 0);
 
   if (hstat.len == hstat.contlen)
diff -bur wget-1.8.2.orig/src/connect.c wget-1.8.2/src/connect.c
--- wget-1.8.2.orig/src/connect.c   Mon Oct  6 17:13:11 2003
+++ wget-1.8.2/src/connect.cMon Oct  6 17:10:28 2003
@@ -47,6 +47,10 @@
 #endif
 #endif /* WINDOWS */
 
+#if #system(bs2000)
+#include 
+#endif
+
 #include 
 #ifdef HAVE_STRING_H
 # include 
@@ -73,6 +77,26 @@
to connect_to_one.  */
 static const char *connection_host_name;
 
+#if 'A' == '\xC1' /* CHARSET_EBCDIC */
+/* Start off with convert=1 (headers are always converted) */
+static int convert_flag_last_reply = 1;
+
+void
+http_set_convert_flag(const char *type)
+{
+convert_flag_last_reply = 
+   (strncasecmp(type, "text/", 5) == 0 
+   || strncasecmp(type, "message/", 8) == 0 
+   || strcasecmp(type, "application/postscript") == 0);
+}
+
+int
+http_get_convert_flag()
+{
+return convert_flag_last_reply;
+}
+#endif
+ 
 void
 set_connection_host_name (const char *host)
 {
@@ -459,6 +483,11 @@
 }
   while (res == -1 && errno == EINTR);
 
+#if 'A' == '\xC1'
+  if (res > 0 && http_get_convert_flag())
+_a2e_n(buf,res);
+#endif
+
   return res;
 }
 
@@ -472,6 +501,25 @@
 {
   int res = 0;
 
+#if 'A' == '\xC1' /* CHARSET_EBCDIC */
+  static char *cbuf = NULL;
+  static int csize = 0;
+
+  if (len > csize) {
+if (cbuf != NULL)
+  free(cbuf);
+cbuf = malloc(csize = len+8192); /* add arbitrary amount of skew */
+   

Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Stefan Eissing
Am Dienstag, 07.10.03, um 17:02 Uhr (Europe/Berlin) schrieb Hrvoje 
Niksic:
That's probably true. But have you tried sending without
Content-Length and Connection: close and closing the output side of
the socket before starting to read the reply from the server?
That might work, but it sounds too dangerous to do by default, and too
obscure to devote a command-line option to.  Besides, HTTP/1.1
*requires* requests with a request-body to provide Conent-Length:
   For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
   containing a message-body MUST include a valid Content-Length
   header field unless the server is known to be HTTP/1.1 compliant.
I just checked with RFC 1945 and it explicitly says that POSTs must
carry a valid Content-Length header.
That leaves the option of first sending an OPTIONS request to the
server (either url or *) to check the HTTP version.
//Stefan




Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic
Stefan Eissing <[EMAIL PROTECTED]> writes:

> Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje
> Niksic:
>> What the current code does is: determine the file size, send
>> Content-Length, read the file in chunks (up to the promised size) and
>> send those chunks to the server.  But that works only with regular
>> files.  It would be really nice to be able to say something like:
>>
>> mkisofs blabla | wget http://burner/localburn.cgi --post-file
>> /dev/stdin
>
> That would indeed be nice. Since I'm coming from the WebDAV side
> of life: does wget allow the use of PUT?

No.

>> I haven't checked, but I'm 99% convinced that browsers simply don't
>> give a shit about non-regular files.
>
> That's probably true. But have you tried sending without
> Content-Length and Connection: close and closing the output side of
> the socket before starting to read the reply from the server?

That might work, but it sounds too dangerous to do by default, and too
obscure to devote a command-line option to.  Besides, HTTP/1.1
*requires* requests with a request-body to provide Conent-Length:

   For compatibility with HTTP/1.0 applications, HTTP/1.1 requests
   containing a message-body MUST include a valid Content-Length
   header field unless the server is known to be HTTP/1.1 compliant.


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder <[EMAIL PROTECTED]> writes:

> I guess, you as the wget maintainer switched from something
> supported to the unsupported "betaX" scheme and now we have
> something to talk about ;)

I had no idea that something as usual as "betaX" was unsupported.  In
fact, I believe that "bX" was added when Francois saw me using it in
Wget.  :-)

> Using something different then exactly "wget-1.9-b3.de.po" will
> confuse the robot



>> Returning an error that says "your version number is unparsable to
>> this piece of software, you must use one of <...> instead" would be
>> more correct in the long run.
>
> Sure.  You should have receive a message like this, didn't you?

I didn't.  Maybe it was an artifact of robot not having worked at the
time, though.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Stefan Eissing
Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje 
Niksic:
What the current code does is: determine the file size, send
Content-Length, read the file in chunks (up to the promised size) and
send those chunks to the server.  But that works only with regular
files.  It would be really nice to be able to say something like:
mkisofs blabla | wget http://burner/localburn.cgi --post-file 
/dev/stdin
That would indeed be nice. Since I'm coming from the WebDAV side
of life: does wget allow the use of PUT?
My first impulse was to bemoan Wget's antiquated HTTP code which
doesn't understand "chunked" transfer.  But, coming to think of it,
even if Wget used HTTP/1.1, I don't see how a client can send
chunked requests and interoperate with HTTP/1.0 servers.
How do browsers figure out whether they can do a chunked transfer or
not?
I haven't checked, but I'm 99% convinced that browsers simply don't
give a shit about non-regular files.
That's probably true. But have you tried sending without Content-Length
and Connection: close and closing the output side of the socket before
starting to read the reply from the server?
//Stefan




Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic wrote:
>
>> Please be aware that Wget needs to know the size of the POST
>> data in advance.  Therefore the argument to @code{--post-file}
>> must be a regular file; specifying a FIFO or something like
>> @file{/dev/stdin} won't work.
>
> There's nothing that says you have to read the data after you've
> started sending the POST. Why not just read the --post-file before
> constructing the request so that you know how big it is?

I don't understand what you're proposing.  Reading the whole file in
memory is too memory-intensive for large files (one could presumably
POST really huge files, CD images or whatever).

What the current code does is: determine the file size, send
Content-Length, read the file in chunks (up to the promised size) and
send those chunks to the server.  But that works only with regular
files.  It would be really nice to be able to say something like:

mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin

>> My first impulse was to bemoan Wget's antiquated HTTP code which
>> doesn't understand "chunked" transfer.  But, coming to think of it,
>> even if Wget used HTTP/1.1, I don't see how a client can send
>> chunked requests and interoperate with HTTP/1.0 servers.
>
> How do browsers figure out whether they can do a chunked transfer or
> not?

I haven't checked, but I'm 99% convinced that browsers simply don't
give a shit about non-regular files.


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

> Please be aware that Wget needs to know the size of the POST data
> in advance.  Therefore the argument to @code{--post-file} must be
> a regular file; specifying a FIFO or something like
> @file{/dev/stdin} won't work.

There's nothing that says you have to read the data after you've started
sending the POST. Why not just read the --post-file before constructing the
request so that you know how big it is?

> My first impulse was to bemoan Wget's antiquated HTTP code which
> doesn't understand "chunked" transfer.  But, coming to think of it,
> even if Wget used HTTP/1.1, I don't see how a client can send chunked
> requests and interoperate with HTTP/1.0 servers.

How do browsers figure out whether they can do a chunked transfer or not?

Tony



Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic <[EMAIL PROTECTED]> writes:
>
>> Ouch.  Why does the robot care about version names at all?
>
> It must know about the sequences; this is important for merging
> issues.  IIRC, we have at least these sequences supported by the
> robot:
>
> 1.2 -> 1.2.1 -> 1.2.2 -> 1.3 etc.
>
> 1.2 -> 1.2a -> 1.2b -> 1.3
>
> 1.2 -> 1.3-pre1 -> 1.3-pre2 -> 1.3
>
> 1.2 -> 1.3-b1 -> 1.3-b2 -> 1.3

Thanks for the clarification, Karl.  But as a maintainer of a project
that tries to use the robot, I must say that I'm not happy about this.

If the robot absolutely must be able to collate versions, then it
should be smarter about it and support a larger array of formats in
use out there.  See `dpkg' for an example of how it can be done,
although the TP robot certainly doesn't need to do all that `dpkg'
does.

This way, unless I'm missing something, the robot seems to be in the
position to dictate its very narrow-minded versioning scheme to the
projects that would only like to use it (the robot).  That's really
bad.  But what's even worse is that something or someone silently
changed "beta3" to "b3" in the POT, and then failed to perform the
same change for my translation, which caused it to get dropped without
notice.  Returning an error that says "your version number is
unparsable to this piece of software, you must use one of <...>
instead" would be more correct in the long run.

Is the robot written in Python?  Would you consider it for inclusion
if I donated a function that performed the comparison more fully
(provided, of course, that the code meets your standards of quality)?


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic <[EMAIL PROTECTED]> writes:
>
>> I'm not sure what "b3" is, but the version in the POT file was
>> supposed to be "beta3".  Was there a misunderstanding somewhere along
>> the line?
>
> Yes, the robot does not like beta3 as part of the version
> string. "b3" or "pre3" are okay.

Ouch.  Why does the robot care about version names at all?


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder <[EMAIL PROTECTED]> writes:

>> Also, my Croatian translation of 1.9 doesn't seem to have made it
>> in.  Is that expected?
>
> Unfortunately, yes.  Will you please resubmit it with the subject line
> updated (IIRC, it's now):
>
> TP-Robot wget-1.9-b3.hr.po

I'm not sure what "b3" is, but the version in the POT file was
supposed to be "beta3".  Was there a misunderstanding somewhere along
the line?


Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Karl Eichwalder <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic <[EMAIL PROTECTED]> writes:
>
>> As for the Polish translation, translations are normally handled
>> through the Translation Project.  The TP robot is currently down, but
>> I assume it will be back up soon, and then we'll submit the POT file
>> and update the translations /en masse/.
>
> It took a little bit longer than expected but now, the robot is up and
> running again.  This morning (CET) I installed b3 for translation.

However, http://www2.iro.umontreal.ca/~gnutra/registry.cgi?domain=wget
still shows `wget-1.8.2.pot' to be the "current template for [the]
domain".  Also, my Croatian translation of 1.9 doesn't seem to have
made it in.  Is that expected?


Re: -q and -S are incompatible

2003-10-07 Thread Hrvoje Niksic
Dan Jacobson <[EMAIL PROTECTED]> writes:

> -q and -S are incompatible and should perhaps produce errors and be
> noted thus in the docs.

They seem to work as I'd expect -- `-q' tells Wget to print *nothing*,
and that's what happens.  The output Wget would have generated does
contain HTTP headers, among other things, but it never gets printed.

> BTW, there seems no way to get the -S output, but no progress
> indicator.  -nv, -q kill them both.

It's a bug that `-nv' kills `-S' output, I think.

> P.S. one shouldn't have to confirm each bug submission. Once should
> be enough.

You're right.  :-(  I'll ask the sunsite people if there's a way to
establish some form of white lists...



Re: some wget patches against beta3

2003-10-07 Thread Hrvoje Niksic
Thanks!



Re: some wget patches against beta3

2003-10-07 Thread Karl Eichwalder
Hrvoje Niksic <[EMAIL PROTECTED]> writes:

> As for the Polish translation, translations are normally handled
> through the Translation Project.  The TP robot is currently down, but
> I assume it will be back up soon, and then we'll submit the POT file
> and update the translations /en masse/.

It took a little bit longer than expected but now, the robot is up and
running again.  This morning (CET) I installed b3 for translation.



Re: wget 1.9 - behaviour change in recursive downloads

2003-10-07 Thread Jochen Roderburg
Zitat von Hrvoje Niksic <[EMAIL PROTECTED]>:

> Jochen Roderburg <[EMAIL PROTECTED]> writes:
> 
> > Zitat von Hrvoje Niksic <[EMAIL PROTECTED]>:
> >
> >> It's a feature.  `-A zip' means `-A zip', not `-A zip,html'.  Wget
> >> downloads the HTML files only because it absolutely has to, in order
> >> to recurse through them.  After it finds the links in them, it deletes
> >> them.
> >
> > Hmm, so it has really been an undetected error over all the years
> > ;-) ?
> 
> s/undetected/unfixed/
> 
> At least I've always considered it an error.  I didn't know people
> depended on it.

Well, *depend* is a rather strong expression for that ;-)
It worked that way always, I got used to it, I never really thought if it was
correct or not, because I had a use for it. So I was astonished, when these
files suddenly disappeared.

As I wrote already, I will mention them explicitly now. I think, the worst that
will happen is that I get a few more of them than before.

Perhaps the whole thing could be mentioned in the documentation of the
accept/reject option. Current there is only this sentence there:

>> Note that these two options do not affect the downloading of HTML
>> files; Wget must load all the HTMLs to know where to go at
>> all--recursive retrieval would make no sense otherwise.

J. Roderburg





Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Daniel Stenberg
On Tue, 7 Oct 2003, Hrvoje Niksic wrote:

> My first impulse was to bemoan Wget's antiquated HTTP code which doesn't
> understand "chunked" transfer.  But, coming to think of it, even if Wget
> used HTTP/1.1, I don't see how a client can send chunked requests and
> interoperate with HTTP/1.0 servers.
>
> The thing is, to be certain that you can use chunked transfer, you
> have to know you're dealing with an HTTP/1.1 server.  But you can't
> know that until you receive a response.  And you don't get a response
> until you've finished sending the request.  A chicken-and-egg problem!

The only way to deal with this automaticly, that I can think of, is to use a
"Expect: 100-continue" request-header and based on the 100-response you can
decide if the server is 1.1 or not.

Other than that, I think a command line option is the only choice.

-- 
 -=- Daniel Stenberg -=- http://daniel.haxx.se -=-
  ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol


Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Stefan Eissing
Theoretically, a HTTP/1.0 server should accept an unknown content-length
if the connection is closed after the request.
Unfortunately, the response 411 Length Required, is only defined in 
HTTP/1.1.

//Stefan

Am Dienstag, 07.10.03, um 01:12 Uhr (Europe/Berlin) schrieb Hrvoje 
Niksic:

As I was writing the manual for `--post', I decided that I wasn't
happy with this part:
Please be aware that Wget needs to know the size of the POST data
in advance.  Therefore the argument to @code{--post-file} must be
a regular file; specifying a FIFO or something like
@file{/dev/stdin} won't work.
My first impulse was to bemoan Wget's antiquated HTTP code which
doesn't understand "chunked" transfer.  But, coming to think of it,
even if Wget used HTTP/1.1, I don't see how a client can send chunked
requests and interoperate with HTTP/1.0 servers.
The thing is, to be certain that you can use chunked transfer, you
have to know you're dealing with an HTTP/1.1 server.  But you can't
know that until you receive a response.  And you don't get a response
until you've finished sending the request.  A chicken-and-egg problem!
Of course, once a response is received, we could remember that we're
dealing with an HTTP/1.1 server, but that information is all but
useless, since Wget's `--post' is typically used to POST information
to one URL and exit.
Is there a sane way to stream data to HTTP/1.0 servers that expect
POST?