POST followed by GET

2003-10-14 Thread Tony Lewis
I'm trying to figure out how to do a POST followed by a GET.

If I do something like:

wget http://www.somesite.com/post.cgi --post-data 'a=1b=2' 
http://www.somesite.com/getme.html -d

I get the following behavior:

POST /post.cgi HTTP/1.0
snip
[POST data: a=1b=2]
snip
POST /getme.html HTTP/1.0
snip
[POST data: a=1b=2]

Is this what is expected? Is there a way I can coax wget to POST to post.cgi and GET 
getme.html?

Tony

Wget 1.8.2 bug

2003-10-14 Thread Sergey Vasilevsky
I use wget 1.8.2.
When I try recursive download site site.com where
site.com/ first page redirect to site.com/xxx.html that have first link in
the page to site.com/
then Wget download only xxx.html and stop.
Other links from xxx.html not followed!



Question about url convert

2003-10-14 Thread Sergey Vasilevsky
Have wget any rules to convert retrive url to store url?
Or may be in future?

For example:
Get - site.com/index.php?PHPSESSID=123124324 
Filter - /PHPSESSID=[a-z0-9]+//i
Save as - site.com/index.php


Re: POST followed by GET

2003-10-14 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 I'm trying to figure out how to do a POST followed by a GET.

 If I do something like:

 wget http://www.somesite.com/post.cgi --post-data 'a=1b=2' 
 http://www.somesite.com/getme.html -d

Well... `--post-data' currently affects all the URLs in the Wget run.
I'm not sure if that makes sense... perhaps it should only apply to
the first one.  But I'm not sure that makes sense either -- what if I
*want* to POST the same data to two URLs, much like you want to POST
to one and GET to the other?

Maybe the right thing would be for `--post-data' to only apply to the
URL it precedes, as in:

wget --post-data=foo URL1 --post-data=bar URL2 URL3

In that case, URL1 would be POSTed with foo, URL2 with bar, and URL3
would be fetched with GET.

But I'm not at all sure that it's even possible to do this and keep
using getopt!

What do the others think?


Re: Wget 1.8.2 bug

2003-10-14 Thread Hrvoje Niksic
Sergey Vasilevsky [EMAIL PROTECTED] writes:

 I use wget 1.8.2.  When I try recursive download site site.com where
 site.com/ first page redirect to site.com/xxx.html that have first
 link in the page to site.com/ then Wget download only xxx.html and
 stop.  Other links from xxx.html not followed!

I've seen pages that do that kind of redirections, but Wget seems to
follow them, for me.  Do you have an example I could try?


Re: Question about url convert

2003-10-14 Thread Hrvoje Niksic
Sergey Vasilevsky [EMAIL PROTECTED] writes:

 Have wget any rules to convert retrive url to store url?  Or may be
 in future?

 For example:
 Get - site.com/index.php?PHPSESSID=123124324 
 Filter - /PHPSESSID=[a-z0-9]+//i
 Save as - site.com/index.php

The problem with this is that it would require the use of a regexp
library, which I'm trying to avoid for Wget.  There are many different
regexp libraries, many with incompatible syntaxes and interfaces, and
a full-blown regexp library is just too large to carry around with a
program like Wget.

If you can think of a way to get that kind of functionality with
something that is not based on regexps, it will have a much better
chance of getting in.


Re: POST followed by GET

2003-10-14 Thread Tony Lewis
Hrvoje Niksic wrote:

 Maybe the right thing would be for `--post-data' to only apply to the
 URL it precedes, as in:

 wget --post-data=foo URL1 --post-data=bar URL2 URL3

snip
 But I'm not at all sure that it's even possible to do this and keep
 using getopt!

I'll start by saying that I don't know enough about getopt to comment on
whether Hrvoje's suggestion will work.

It's hard to imagine a situation where wget's current behavior makes sense
over multiple URLs. I'm sure someone can come up with an example, but it's
likely to be an unusual case. I see the ability to POST a form as being most
useful when a site requires some kind of form-based authentication to
proceed with looking at other pages within the site.

Some alternatives that occur to me follow.

Alternative #1. Only apply --post-data to the first URL on the command line.
(A simple solution that probably covers the majority of cases.)


Alternative #2. Allow POST and GET as keywords in the URL list so that:

wget POST http://www.somesite.com/post.cgi --post-data 'a=1b=2' GET
http://www.somesite.com/getme.html

would explicitly specify which URL uses POST and which uses GET. If more
than one POST is specified, all use the same --post-data.


Alternative #3. Look for form tags and have --post-file specify the data
to be specified to various forms:

--form-action=URL1 'a=1b=2'
--form-action=URL2 'foo=bar'


Alternative #4. Allow complex sessions to be defined using a session file
such as:

wget --session=somefile --user-agent='my robot'

Options specified on the command line apply to every URL. If somefile
contained:

--post-data 'data=foo' POST URL1
--post-data 'data=bar' POST URL2
--referer=URL3 GET URL4

It would be the same logically equivalent to the following three commands:

wget --user-agent='my robot' --post-data 'data=foo' POST URL1
wget --user-agent='my robot' --post-data 'data=bar' POST URL2
wget --user-agent='my robot' --referer=URL3 GET URL4

with wget's state maintained across the session.

Tony



Re: POST followed by GET

2003-10-14 Thread Daniel Stenberg
On Tue, 14 Oct 2003, Tony Lewis wrote:

 It would be the same logically equivalent to the following three commands:

 wget --user-agent='my robot' --post-data 'data=foo' POST URL1
 wget --user-agent='my robot' --post-data 'data=bar' POST URL2
 wget --user-agent='my robot' --referer=URL3 GET URL4

Just as a comparison, this approach is basicly what we've went with in curl
(curl has supported this kind of operations for years, including support for
multipart formposts which I guess is next up for adding to wget! ;-P).  There
are just too many options or specifics that you can set, so having them all
possible to change between several URLs specified on the command line makes
the command line parser complicated and the command lines even more complex.

The main thing this described approach requires (that I can think of) is that
wget would need to store session cookies as well in the cookie file (I believe
I read that it doesn't atm).

-- 
 -=- Daniel Stenberg -=- http://daniel.haxx.se -=-
  ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol


wget and ipv6 (1.6 beta5) serious bugs

2003-10-14 Thread Arkadiusz Miskiewicz
Hi,

Right now wget code looks like this:

#ifdef ENABLE_IPV6
int ip_default_family = AF_INET6;
#else
int ip_default_family = AF_INET;
#endif

and then
./connect.c:  sock = socket (ip_default_family, SOCK_STREAM, 0);


This assumes that binary compiled with ipv6 support is always used on IPv6 
capable host which is not true in many, many cases. Such binary on ipv4 only 
host will cause:

[EMAIL PROTECTED] src]$ LC_ALL=C ./wget wp.pl
--21:48:37--  http://wp.pl/
   = `index.html'
Resolving wp.pl... 212.77.100.101
Connecting to wp.pl[212.77.100.101]:80... failed: Address family not supported 
by protocol.
Retrying.

--21:48:38--  http://wp.pl/
  (try: 2) = `index.html'
Connecting to wp.pl[212.77.100.101]:80... failed: Address family not supported 
by protocol.
Retrying.

--21:48:40--  http://wp.pl/
  (try: 3) = `index.html'
Connecting to wp.pl[212.77.100.101]:80... failed: Address family not supported 
by protocol.
Retrying.


Applications that use getaddrinfo() shouldn't even bother to know which family 
they use. Just should do

getaddrinfo(host, ..., res0);
for (res = res0; res; res=res-ai_next) {
  s = socket(res-ai_family, res-ai_socktype, res-ai_protocol)
  if (s0)
continue
  if ((connect(s, res-ai_addr, res-ai_addrlen) 0 ) {
 close(s)
 continue)
  }
  break
}

This pseudo-code should show the idea. 


The best thing IMO is to use getaddrinfo for resolving + struct addrinfo 
(linked list) for storing data about host.x.y.com. For systems without 
getaddrinfo ipv4 only replacements should be provided - see openssh portable 
how it's done there.

The whole idea of getaddrinfo/getnameinfo is to get family independent 
functions. They even work for AF_UNIX on some systems (like on linux+glibc).

Anyway for now workaround is something like this in main():

#ifdef ENABLE_IPV6
s = socket(AF_INET6, SOCK_STREAM, 0);
if (s  0  (errno == EAFNOSUPPORT))
   ip_default_family = AF_INET;
close(s);
#endif

-- 
Arkadiusz MikiewiczCS at FoE, Wroclaw University of Technology
arekm.pld-linux.org AM2-6BONE, 1024/3DB19BBD, arekm(at)ircnet, PLD/Linux



bug in 1.8.2 with

2003-10-14 Thread Noèl Köthe
Hello,

which this download you will get a segfault.

wget --passive-ftp --limit-rate 32k -r -nc -l 50 \
-X */binary-alpha,*/binary-powerpc,*/source,*/incoming \
-R alpha.deb,powerpc.deb,diff.gz,.dsc,.orig.tar.gz \
ftp://ftp.gwdg.de/pub/x11/kde/stable/3.1.4/Debian

Philip Stadermann [EMAIL PROTECTED] discover this problem
and submitted the attached patch.
Its a problem with the linked list.

-- 
Nol Kthe noel debian.org
Debian GNU/Linux, www.debian.org
--- ftp.c.orig  2003-10-14 15:37:15.0 +0200
+++ ftp.c   2003-10-14 15:39:28.0 +0200
@@ -1670,22 +1670,21 @@
 static uerr_t
 ftp_retrieve_glob (struct url *u, ccon *con, int action)
 {
-  struct fileinfo *orig, *start;
+  struct fileinfo *start;
   uerr_t res;
   struct fileinfo *f;
  
 
   con-cmd |= LEAVE_PENDING;
 
-  res = ftp_get_listing (u, con, orig);
+  res = ftp_get_listing (u, con, start);
   if (res != RETROK)
 return res;
-  start = orig;
   /* First: weed out that do not conform the global rules given in
  opt.accepts and opt.rejects.  */
   if (opt.accepts || opt.rejects)
 {
-   f = orig;
+   f = start;
   while (f)
{
  if (f-type != FT_DIRECTORY  !acceptable (f-name))
@@ -1698,7 +1697,7 @@
}
 }
   /* Remove all files with possible harmful names */
-  f = orig;
+  f = start;
   while (f)
   {
  if (has_invalid_name(f-name))


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: POST followed by GET

2003-10-14 Thread Hrvoje Niksic
I like these suggestions.  How about the following: for 1.9, document
that `--post-data' expects one URL and that its behavior for multiple
specified URLs might change in a future version.

Then, for 1.10 we can implement one of the alternative behaviors.



Re: bug in 1.8.2 with

2003-10-14 Thread Hrvoje Niksic
You're right -- that code was broken.  Thanks for the patch; I've now
applied it to CVS with the following ChangeLog entry:

2003-10-15  Philip Stadermann  [EMAIL PROTECTED]

* ftp.c (ftp_retrieve_glob): Correctly loop through the list whose
elements might have been deleted.




Re: POST followed by GET

2003-10-14 Thread Tony Lewis
Hrvoje Niksic wrote:

 I like these suggestions.  How about the following: for 1.9, document
 that `--post-data' expects one URL and that its behavior for multiple
 specified URLs might change in a future version.

 Then, for 1.10 we can implement one of the alternative behaviors.

That works for me... I can hardly wait for 1.9 to get wrapped up so we can
start working on 1.10.

Hrvoje, has anyone mentioned how glad we are that you've come back?

Tony



Re: wget and ipv6 (1.6 beta5) serious bugs

2003-10-14 Thread Hrvoje Niksic
Thanks for the report.  I agree that the current code does not work
for many uses -- that's why IPv6 is still experimental.  Mauro
Tortonesi is working on contributing IPv6 support that works better.

For the impending release, I think the workaround you posted makes
sense.  Mauro, what do you think?