RE: wget re-download fully downloaded files

2008-10-27 Thread Tony Lewis
Micah Cowan wrote:

 Actually, I'll have to confirm this, but I think that current Wget will
 re-download it, but not overwrite the current content, until it arrives
 at some content corresponding to bytes beyond the current content.

 I need to investigate further to see if this change was somehow
 intentional (though I can't imagine what the reasoning would be); if I
 don't find a good reason not to, I'll revert this behavior.

One reason to keep the current behavior is to retain all of the existing
content in the event of another partial download that is shorter than the
previous one. However, I think that only makes sense if wget is comparing
the new content with what is already on disk.


RE: [PATCH] Enable wget to download from given offset and just a given amount of bytes

2008-10-23 Thread Tony Lewis
Juan Manuel wrote:


 OK, you are right, I`ll try to make it better on my free time. I

 supposed that it would have been more polite with one option, but

 thought it was easier with two (and since this is my first

 approach to C I took the easy way) because one option would have

 to deal with two parameters.


It's clearly easier to deal with options that wget is already programmed to
support. For a primer on wget options, take a look at this page on the wiki:


I suspect you will need to add support for a new action (perhaps cmd_range).





RE: A/R matching against query strings

2008-10-22 Thread Tony Lewis
Micah Cowan wrote:

 Would hash really be useful, ever?

Probably not as long as we strip off the hash before we do the comparison.


RE: A/R matching against query strings

2008-10-21 Thread Tony Lewis
Micah Cowan wrote:

 On expanding current URI acc/rej matches to allow matching against query
 strings, I've been considering how we might enable/disable this
 functionality, with an eye toward backwards compatibility.

What about something like --match-type=TYPE (with accepted values of all,
hash, path, search)?

For the URL

all would match against the entire string
hash would match against content
path would match against path/to/name.html
search would match against a=true

For backward compatibility the default should be --match-type=path.

I thought about having host as an option, but that duplicates another


RE: Big files

2008-09-16 Thread Tony Lewis
Cristián Serpell wrote:

 I would like to know if there is a reason for using a signed int for  
 the length of the files to download.

I would like to know why people still complain about bugs that were fixed
three years ago. (More accurately, it was a design flaw that originated from
a time when no computer OS supported files that big, but regardless of what
you call it, the change to wget was made to version 1.10 in 2005.)


RE: Big files

2008-09-16 Thread Tony Lewis
Cristián Serpell wrote:

 Maybe I should have started by this (I had to change the name of the  
 file shown):
 ---response begin---
 HTTP/1.1 200 OK
 Date: Tue, 16 Sep 2008 19:37:46 GMT
 Server: Apache
 Last-Modified: Tue, 08 Apr 2008 20:17:51 GMT
 ETag: 7f710a-8a8e1bf7-47fbd2ef
 Accept-Ranges: bytes
 Content-Length: -1970398217

The problem is not with wget. It's with the Apache server, which told wget
that the file had a negative length.


RE: Wget and Yahoo login?

2008-08-21 Thread Tony Lewis
Micah Cowan wrote:

 The easiest way to do what you want may be to log in using your browser,
 and then tell Wget to use the cookies from your browser, using

Given the frequency of the login and then download a file use case , it
should probably be documented on the wiki. (Perhaps it already is. :-)

Also, it would probably be helpful to have a shell script to automate this.


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Tony Lewis
Coombe, Allan David (DPS) wrote:

 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think wget uses the case from the HTML page(s) for the file name; your
proxy would need to change the URLs in the HTML pages to lower case too.


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Tony Lewis
mm w wrote:

 a simple url-rewriting conf should fix the problem, wihout touch the file 
 everything can be done server side

Why do you assume the user of wget has any control over the server from which 
content is being downloaded?

RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-14 Thread Tony Lewis
mm w wrote:

 Hi, after all, after all it's only my point of view :D
 dir/File, non-standard
 Dir/file, non-standard
 and /Dir/File non-standard

According to RFC 2396: The path component contains data, specific to the 
authority (or the scheme if there is no authority component), identifying the 
resource within the scope of that scheme and authority.

In other words, those names are well within the standard when the server 
understands them. As far as I know, there is nothing in Internet standards 
restricting mixed case paths.

 that's it, if the server manages non-standard URL, it's not my
 concern, for me it doesn't exist

Oh. I see. You're writing to say that wget should only implement features that 
are meaningful to you. Thanks for your narcissistic input.


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could
 file a feature request at, for an option
 asking Wget to treat URLs case-insensitively.

To have the effect that Allan seeks, I think the option would have to convert 
all URIs to lower case at an appropriate point in the process. I think you 
probably want to send the original case to the server (just in case it really 
does matter to the server). If you're going to treat different case URIs as 
matching then the lower-case version will have to be stored in the hash. The 
most important part (from the perspective that Allan voices) is that the 
versions written to disk use lower case characters.


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
mm w wrote:

 standard: the URL are case-insensitive

 you can adapt your software because some people don't respect standard,
 we are not anymore in 90's, let people doing crapy things deal with
 their crapy world

You obviously missed the point of the original posting: how can one 
conveniently mirror a site whose server uses case insensitive names onto a 
server that uses case sensitive names.

If the original site has the URI strings /dir/file, dir/File, Dir/file, 
and /Dir/File, the same local file will be returned. However, wget will treat 
those as unique directories and files and you wind up with four copies.

Allan asked if there is a way to have wget just create one copy and proposed 
one way that might accomplish that goal.


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Steven M. Schweda wrote:

 From Tony Lewis:
  To have the effect that Allan seeks, I think the option would have to
  convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
 details like name hashing may also need to be adjusted, but this
 shouldn't be too hard.

OK. How would you normalize the names?


RE: retrieval of data from a database

2008-06-10 Thread Tony Lewis
Saint Xavier wrote:

 Well, you'd better escape the '' in your shell (\)

It's probably easier to just put quotes around the entire URL than to try to
find all the special characters and put backslashes in front of them.


RE: Not all files downloaded for a web site

2008-01-27 Thread Tony Lewis
Matthias Vill wrote:

 Alexandru Tudor Constantinescu wrote:
  I have the feeling wget is not really able to figure out which files
  to download from some web sites, when css files are used.

 That's right. Up until wget 1.11 (released yesterday) there is no
 support for css-files in the matter of parsing links out of it. There
 for wget will download the css-file, but not any file referenced only

According to Micah's Future of Wget email, CSS support is planned for
1.12. He wrote:

  Support for parsing links from CSS.
 The really big deal here, to me, is CSS. I want to have CSS support for
 Wget ASAP. It's an essential part of the Web, and users definitely
 suffer for the lack of support for it.


RE: Skip certain includes

2008-01-24 Thread Tony Lewis
Wayne Connolly wrote:


 Thanks mate- i know we chatted on IRC but just thought someone

 else may be able to provide some insight.


OK. Here's some insight: wget is essentially a web browser. If the URL
starts with http, then wget sees the exact same content as Internet
Explorer, Firefox, and Opera (except in cases where the server customizes
its content to the user agent - in those cases you may have to tweak the
user agent to see the same content).


If the files are visible to FTP, then try using wget with a URL starting
with ftp instead.  Otherwise, if you want to mirror the files as they
appear on the server, you will have to use something like scp to transfer
the files directly from Server A to Server B.




RE: .1, .2 before suffix rather than after

2007-11-16 Thread Tony Lewis
Hrvoje Niksic wrote:

  And how is .tar.gz renamed?  .tar-1.gz?


OK. I'm responding to the chain and not Hrvoje's expression of pain. :-)

What if we changed the semantics of --no-clobber so the user could specify
the behavior? I'm thinking it could accept the following strings:
- after: append a number after the file name (current behavior)
- before: insert a number before the suffix
- new: change name of new file (current behavior)
- old: change name of old file

With this scheme --no-clobber becomes equivalent to --no-clobber=after,new.
If I want to change where the number appears in the file name or have the
old file renamed then I can specify the behavior I want on the command line
(or in .wgetrc). I think I would change my default to

I think it would be useful to have semantics in .wgetrc where I specify what
I want my --no-clobber default to be without that meaning I want
--no-clobber processing on each invocation. It would be nice if I could say
that I want my default to be before,old, but to only have that apply when
I specify --no-clobber on the command line.

Back to the painful point at the start of this note, I think we treat
.tar.gz as a suffix and if --no-clobber=before is specified, the file name
becomes .1.tar.gz.


RE: Thoughts on Wget 1.x, 2.0 (*LONG!*)

2007-11-02 Thread Tony Lewis
Micah Cowan wrote:

 Keeping a single Wget and using runtime libraries (which we were terming
 plugins) was actually the original concept (there's mention of this in
 the first post of this thread, actually); the issue is that there are
 core bits of functionality (such as the multi-stream support) that are
 too intrinsic to separate into loadable modules, and that, to be done
 properly (and with a minimum of maintenance commitment) would also
 depend on other libraries (that is, doing asynchronous I/O wouldn't
 technically require the use of other libraries, but it can be a lot of
 work to do efficiently and portably across OSses, and there are already
 Free libraries to do that for us).

Perhaps both versions can include multi-threaded support in their core version, 
but the lite version would never invoke multi-threading.


RE: Recursive downloading and post

2007-10-22 Thread Tony Lewis
Micah Cowan wrote

 Stuart Moore wrote:
  Is there any way to get wget to only use the post data for the first
  file downloaded?

 Unfortunately, I'm not sure I can offer much help. AFAICT, --post-file
 and --post-data weren't really designed for use with recursive

Perhaps not, but I can't imagine that there is any scenario where the POST
data should legitimately be sent for anything other than the URL(s) on the
command line.

I'd vote for this being flagged as a bug.


RE: working on patch to limit to percent of bandwidth

2007-10-10 Thread Tony Lewis
Hrvoje Niksic wrote:

 Measuring initial bandwidth is simply insufficient to decide what
 bandwidth is really appropriate for Wget; only the user can know
 that, and that's what --limit-rate does.

The user might be able to make a reasonable guess as to the download rate if
wget reported its average rate at the end of a session. That way the user
can collect rates over time and try to give --limit-rate a reasonable value.


RE: wget + dowbloading AV signature files

2007-09-22 Thread Tony Lewis
Gerard Seibert wrote:

 Is it possible for wget to compare the file named AV.hdb'
 located in one directory, and if it is older than the AV.hdb.gz file
 located on the remote server, to download the AV.hdb.gz file to the
 temporary directory?

No, you can only get wget to compare a file of the same name between your
local system and the remote server.

 The only option I have come up with is to keep a copy of the gz file
 in the temporary directory and run wget from there.

You will need to keep the original gz file with a timestamp matching the
server in order for wget to know that the file you have is the same as the
one on the server.

 Unfortunately, at least as far as I can tell, wget does not issue an
 exit code if it has downloaded a newer file.

Better exit codes is on the wish list.

 It would really be nice though if wget simply issued an exit code if
 an updated file were downloaded.

Yes, it would.

 Therefore, I am unable to craft a script that will unpack the file,
 test and install it if a newer version has been downloaded.

Keep one directory that matches the server and another one (or perhaps two)
where you process new files. Before and after wget runs, you can check the
dates on the directory that matches the server. You only need to process
files that changed.

Hope that helps.


RE: wget url with hash # issue

2007-09-06 Thread Tony Lewis
Micah Cowan wrote:

 If you mean that you want Wget to find any file that matches that
 wildcard, well no: Wget can do that for FTP, which supports directory
 listings; it can't do that for HTTP, which has no means for listing
 files in a directory (unless it has been extended, for example with
 WebDAV, to do so).

Seems to me that is a big unless because we've all seen lots of websites
that have http directory listings. Apache will do it out of the box (and by
default) if there is no index.htm[l] file in the directory.

Perhaps we could have a feature to grab all or some of the files in a HTTP
directory listing. Maybe something like this could be made to work:


Perhaps we would need an option such as --http-directory (the first thing
that came to mind, but not necessarily the most intuitive name for the
option) to explicitly tell wget how it is expected to behave. Or perhaps it
can just try stripping the filename when doing an http request and wildcards
are specified.

At any rate (with or without the command line option), wget would retrieve and then retrieve any links where the target
matches mc*.gif.

If wget is going to explicitly support http directory listings, it probably
needs to be intelligent enough to ignore the sorting options. In the case of
Apache, that would be things like A HREF=?N=DName/A.

Anyone have any idea how many different http directory listing formats are
out there?


RE: Overview of the wget source code (command line options)

2007-07-24 Thread Tony Lewis
Himanshu Gupta wrote:


 Thanks Josh and Micah for your inputs.


In addition to whatever Josh and Micah told you, let me add the information
that follows. More than once I have had to relearn how wget deals with
command line options. The last time I did so, I created the HOWTO that
appears below (comments about this information from those in the know on
this list are welcome). I'm happy to collect any other topics that people
want to submit and add them to the file. Perhaps Micah will even be willing
to add it to the repository. :-)


By the way, if your mail reader throws away line breaks, you will want to
restore them. --Tony


To find out what a command line option does:

  Look in src/main.c in the option_data array for the string to corresponds

  the command line option; the entries are of the form:

  { option, 'O', TYPE, data, argtype },


  where you're searching for option.



Note the value of data. Then look at init.c at the commands array for

an entry that starts with the same data. These lines are of the form:

{ data, opt.variable, cmd_TYPE },


The corresponding line will tell you what variable gets set when that

is selected. Now use grep or some other search tool to find out where

variable is referenced.


For example, the --accept option sets the value of opt.accepts, which is

referenced in ftp.c and utils.c


  If the TYPE is anything else:

Look to see how main.c handles that TYPE.


For example, OPT__APPEND_OUTPUT sets the option named logfile and then

sets the variable append_to_log to true. Searching for append_to_log

shows that it is only used in main.c. Checking init.c (as described

for the option logfile shows that it sets the value of opt.lfilename,

which is referenced in mswindows.c, progress.c, and utils.c.



To add a new command line option:

  The simplest approach is to find an existing option that is close to what

  want to accomplish and mirror it. You will need to edit the following

  as described.



Add a line to the option_data array in the following format:

  { option, 'O', TYPE, data, argtype },



  option   is the long name to be accepted from the command line

  Ois the short name (one character) to be accepted from the

   command or '' if there is no short name; the short name

   must only be assigned to one option. Also, there are very

   few short names available and the maintainers are not

   inclined to give them out unless the option is likely to

   be used frequently.

  TYPE is one of the following standard options:

 OPT_VALUEon the command line, option must be

  followed by a value that will be stored


 OPT_BOOLEAN  option is a boolean value that may appear

  on the command line as --option for true

  or --no-option for false

 OPT_FUNCALL  an internal function will be invoked if the

  option is selected on the command line

   Note: If one of these choices won't work for your option

   you can add a new value of the OPT__XXX to the enum list

   and add special code to handle it in src/main.c.

  data For OPT_VALUE and OPT_BOOLEAN, the name assigned to the

   option in the commands array defined in src/init.c (see

   below). For OPT_FUNCALL, a pointer to the function to be


  argtype  For OPT_VALUE and OPT_BOOLEAN, use -1. For OPT_FUNCALL use



NOTE: The options *must* appear in alphabetical order because a Boolean

search is used for the list.



Add the help string to function print_help as follows:


  -O,  --optiondoes something nifty.\n),


If the short name is '', put spaces in place of -O,.


Select a reasonable place to add the text into the help output in one

of the existing groups of options: Startup, Logging and input file,

Download, Directories, HTTP options, HTTPS (SSL/TLS) options,

FTP options, Recursive download, or Recursive accept/reject.



Define the variable to receive the value of the option in the options




Add a line to the commands array in the following format:

  { data, opt.variable, cmd_TYPE },



  data  matches the data string you entered above in the

options_data array in src/main.c

  variable  is the variable you defined in 

RE: Problem with combinations of the -O , -p, and -k parameters in wget

2007-07-23 Thread Tony Lewis
Michiel de Boer wrote:

 Is there another way though to achieve the same thing?

You can always run wget and then rename the file afterward. If this happens
often, you might want write a shell script to handle it. Of course, If you
want all the references to the file to be converted, the script will be a
little more complicated. :-)


RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote:

 The manpage doesn't need to give as detailed explanations as the info
 manual (though, as it's auto-generated from the info manual, this could
 be hard to avoid); but it should fully describe essential features.

I can't see any good reason for one set of documentation to be different than 
another. Let the user choose whatever is comfortable. Some users may not even 
know they have a choice between man and info.

 While we're on the subject: should we explicitly warn about using such
 features as robots=off, and --user-agent? And what should those warnings
 be? Something like, Use of this feature may help you download files
 from which wget would otherwise be blocked, but it's kind of sneaky, and
 web site administrators may get upset and block your IP address if they
 discover you using it?

No, I don't think we should nor do I think use of those features is sneaky.

With regard to robots.txt, people use it when they don't want *automated* 
spiders crawling through their sites. A well-crafted wget command that 
downloads selected information from a site without regard to the robots.txt 
restrictions is a very different situation. It's true that someone could 
--mirror the site while ignoring robots.txt, but even that is legitimate in 
many cases.

With regard to user agent, many websites customize their output based on the 
browser that is displaying the page. If one does not set user agent to match 
their browser, the retrieved content may be very different than what was 
displayed in the browser.

All that being said, it wouldn't hurt to have a section in the documentation on 
wget etiquette: think carefully about ignoring robots.txt, use --wait to 
throttle the download if it will be lengthy, etc.

Perhaps we can even add a --be-nice option similar to --mirror that adjusts 
options to match the etiquette suggestions.


RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote:

 Don't we already follow typical etiquette by default? Or do you mean
 that to override non-default settings in the rcfile or whatnot?

We don't automatically use a --wait time between requests. I'm not sure what 
other nice options we'd want to make easily available, but there are probably 


RE: Maximum 20 Redirections HELP!!!

2007-07-16 Thread Tony Lewis
Josh Williams wrote:

 Hmm. .org, maybe?

LOL. Do you know how many kewl domain names I had to go through before I
found one that didn't actually exist? Close to a dozen.


RE: 1.11 Release Date: 15 Sept

2007-07-12 Thread Tony Lewis
Noèl Köthe wrote:

 A switch to the new GPL v3 is a not so small change and like samba
 (3.0.x - 3.2) would imho be a good reason for wget 1.2 so everybody
 sees something bigger changed.

There already was a version 1.2 (although the program was called geturl at that 

The number scheme could probably use a facelift. Perhaps when we transition to 
2.0, we can add a third digit.


wget on error on Development page

2007-07-07 Thread Tony Lewis
On, step 1 of the summary is:

1.  Change to the topmost GNU Wget directory:
%  cd wget 

But you need to cd to either wget/trunk or the appropriate version
subdirectory of wget/branches.

RE: wget on Report a Bug

2007-07-07 Thread Tony Lewis
Micah Cowan wrote:

 This information is currently in the bug submitting form at Savannah:

That looks good.

 I think perhaps such things as the wget version and operating system
 ought to be emitted by default anyway (except when -q is given).

I'm not convinced that wget should ordinarily emit the operating system. It's 
really only useful to someone other than the person running the command.

 Other than that, what kinds of things would --bug provide above and
 beyond --debug?

It should echo the command line and the contents of .wgetrc to the bug output, 
which even the --debug option does not do. Perhaps we will think of other 
things to include in the output if this option gets added.

However, the big difference would be where the output was directed. When 
invoked as:
wget ... --bug bug_report

all interesting (but sanitized) information would be written to the file 
bug_report whether or not the command included --debug, which would also direct 
the debugging output to STDOUT.

The main reason I had for suggesting this option is that it would be easy to 
tell newbies with problems to run the exact same command with --bug 
bug_report and send the file bug_report to the list (or to whomever is working 
on the problem). The user wouldn't see the command behave any differently, but 
we'd have the information we need to investigate the report.

It might even be that most of us would choose to run with --bug most of the 
time relying on the normal wget output except when something appears to have 
gone wrong and then checking the file when it does.


RE: wget on error on Development page

2007-07-07 Thread Tony Lewis
Micah Cowan wrote:

 Done. Lemme know if that works for you.

Looks good

RE: bug and patch: blank spaces in filenames causes looping

2007-07-05 Thread Tony Lewis
There is a buffer overflow in the following line of the proposed code:

 sprintf(filecopy, \%.2047s\, file);

It should be:

 sprintf(filecopy, \%.2045s\, file);

in order to leave room for the two quotes.

-Original Message-
From: Rich Cook [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 04, 2007 10:18 AM
Subject: bug and patch: blank spaces in filenames causes looping

On OS X, if a filename on the FTP server contains spaces, and the  
remote copy of the file is newer than the local, then wget gets  
thrown into a loop of No such file or directory endlessly.   I have  
changed the following in ftp-simple.c, and this fixes the error.
Sorry, I don't know how to use the proper patch formatting, but it  
should be clear.

the beginning of ftp_retr:
/* Sends RETR command to the FTP server.  */
ftp_retr (int csock, const char *file)
   char *request, *respline;
   int nwritten;
   uerr_t err;

   /* Send RETR request.  */
   request = ftp_request (RETR, file);

/* Sends RETR command to the FTP server.  */
ftp_retr (int csock, const char *file)
   char *request, *respline;
   int nwritten;
   uerr_t err;
   char filecopy[2048];
   if (file[0] != '') {
 sprintf(filecopy, \%.2047s\, file);
   } else {
 strncpy(filecopy, file, 2047);

   /* Send RETR request.  */
   request = ftp_request (RETR, filecopy);

Rich wealthychef Cook
  it takes many small steps to climb a mountain, but the view gets  
better all the time.

RE: Suppressing DNS lookups when using wget, forcing specific IP address

2007-06-18 Thread Tony Lewis
Try: wget --header=Host: --mirror

For example: wget --header=Host: --mirror

-Original Message-
From: Kelly Jones [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 17, 2007 6:10 PM
Subject: Suppressing DNS lookups when using wget, forcing specific IP

I'm moving a site from one server to another, and want to use wget
-m combined w/ diff -auwr to help make sure the site looks the same
on both servers.

My problem: wget -m always downloads the site at its
*current* IP address. Can I tell wget: download, but
pretend the IP address of is
instead of ip.address.of.old.server. In other words, suppress the DNS
lookup for and force it to use a given IP address.

I've considered kludges like using vs, editing /etc/hosts, using a proxy server, etc,
but I'm wondering if there's a clean solution here?

We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.

RE: Question on wget upload/dload usage

2007-06-18 Thread Tony Lewis
Joe Kopra wrote:


 The wget statement looks like:


 wget --post-file=serverdata.mup -o postlog -O survey.html


--post-file does not work the way you want it to; it expects a text file
that contains something like this:



and it sends that raw text to the server in a POST request using a
Content-Type of application/x-www-form-urlencoded. If you run it with -d,
you will see something like this:


POST /someurl HTTP/1.0

User-Agent: Wget/1.10

Accept: */*


Connection: Keep-Alive

Content-Type: application/x-www-form-urlencoded

Content-Length: 7


---request end---

[writing POST file data ... done]


To post a file as an argument, you need a Content-Type of
multipart/form-data, which wget does not currently support.



RE: wget bug

2007-05-24 Thread Tony Lewis
Highlord Ares wrote:


 it tries to download web pages named similar to


Since  is a reserved character in many command shells, you need to quote
the URL on the command line:






RE: sending Post Data and files

2007-05-09 Thread Tony Lewis
Lara Röpnack wrote:


 1.) How can I send Post Data with Line Breaks? I can not press enter

 and \n or \r or \r\n dont work...


You don’t need a line break because parameters are separated by ampersands;

 2.) I dont understand the post File. I can Send one file - but I cant give

 the name. Normaly I have a Form with a Formelement Input type=file

 name=xy Is it possible to send a File with a name? Is it possible to send

 two files? 


On the command line you can use --post-data=”a=1b=2” or you can put the
data into a file. For example if the file “foo” contains the following


you would use --post-file=foo.


Currently, it is not possible to send files with wget. It does not support



RE: FW: think you have a bug in CSS processing

2007-04-13 Thread Tony Lewis
J.F.Groff wrote:

 Amazingly I found this feature request in a 2003 message to this very
 list. Are there only a few lunatics like me who think this should be

Wget is written and maintained by volunteers. What you need to find is a
lunatic willing to volunteer to write the code to support this feature


RE: Suggesting Feature: Download anything newer than...

2007-04-07 Thread Tony Lewis
I don't think there is such a feature, but if you're going to add
--not-before, you might as well add --not-after too.

-Original Message-
Sent: Saturday, April 07, 2007 6:27 PM
Subject: Suggesting Feature: Download anything newer than...

I'm a very frequent user of wget, but must admit I haven't
dived too deep into various options - but as far as I can
tell, what I'm about to suggest is not a current feature.
If it is, can somebody tell me how to access it?  0:-)

What I'm suggesting is something similar to -N (check
timestamp and download newer) and may perhaps be used more
as a modifier to -N than a seperate option.

I occationally make a mirror of certain site with wget, and
then throw it into an archive.  Unfortuanately, a few months
(year) later when I want to catch-up with any updates, I either
have to mirror the whole thing again or locate the old archive
and unpack it (and I haven't necesserely preserved the whole
directory structure).

What I would love was the ability to specify (through an option)
an arbitrary timestamp (a date... and perhaps time), and for
only files created/modified after this time to be downloaded (e.g.
the approximate time for the creation of my latest archive).

I am envision it as based on the -N option; except that rather
than looking on the time-stamp - or the size or even the
existance - of a local file, it would only compare the remote file's
timestamp to the supplied timestamp - and download if the remote
file was newer.  Of course, it would probably be h*** of a lot worse
to program than just rewriting the -N option.  :-)

It would have to parse links in HTML-files (HTML) or traverse
directories (FTP).

Usually it would be used when no local mirror existed, and then
creating a mirror of just files made after a certain time (it would
of course have to create a dir-structure containing directories
also older than the specified time, but no older files).  However
being able to use it (a specified time) together with the -N or
--mirror option, may also be useful when updating a local mirror
(though I can't actually see when); so perhaps it should be an option
to be used in *companion* with -N (rather than instead of -N)... or
at least let it be *possible* to use it together with -N and --mirror
as well as by itself.


RE: Cannot write to auto-generated file name

2007-04-03 Thread Tony Lewis
Vitaly Lomov wrote:

 It's a file system issue on windows: file path length is limited to
 259 chars.

In which case, wget should do something reasonable (generate an error
message, truncate the file name, etc.). It shouldn't be left as exercise for
the user to figure out that the automatically generated name cannot be used
by the OS. (My vote is to truncate the name, but it's a lot easier to
generate an error message.)



2007-03-23 Thread Tony Lewis
Bruce [EMAIL PROTECTED] wrote:

 the hostname '' resoles back to an NX domain 

NXDOMAIN is short hand for non-existent domain. It means the domain name
system doesn't know the IP address of the domain. (It would be like me
having a non-published telephone number; if you know my number, you can call
me, but it won't do you any good to call directory assistance because they
can't tell you my number.)

If your web browser is able to find the site then it should be possible for
wget to find it too. But, since it's not a straightforward DNS lookup,
you'll have to figure out how your browser is pulling off the magic.

One way to do that is to run with a local proxy (such as Achilles) and study
what happens between your browser and the server. If you compare that with
the debug output of wget, you'll have an idea of where the flow is different
and what wget might do to make it work.

I'm sure someone can point out open-source options for the proxy. :-)

Have fun exploring.


RE: wget help on file download

2007-03-01 Thread Tony Lewis
The server told wget that it was going to return 6K:
Content-Length: 6720

From: Smith, Dewayne R. [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 01, 2007 8:05 AM
Subject: wget help on file download

Trying to download a 4mb file. it only retrieves 6k of it.
I've tried without the added --options and it doesn't work.
Can you see any issues below?
C:\Backup_CD\WGETwget -dv -S --no-http-keep-alive  --ignore-length
--secure-protocol=auto --no-check-certificate  https://
Setting --verbose (verbose) to 1
Setting --server-response (serverresponse) to 1
Setting --http-keep-alive (httpkeepalive) to 0
Setting --ignore-length (ignorelength) to 1
Setting --secure-protocol (secureprotocol) to auto
Setting --check-certificate (checkcertificate) to 0
DEBUG output created by Wget 1.10.2 on Windows.
   = `TR 2004.018 AEGIS TEST PLAN..pdf.4'
Resolving seconds 0.00,
Caching =
Connecting to||:443... seconds 0.00,
Created socket 1932.
Releasing 0x00395228 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 1932 to SSL handle 0x009318c8
  subject: /C=US/O=U.S.
  issuer:  /C=US/O=U.S. Government/OU=ECA/OU=Certification
Authorities/CN=ORC ECA
WARNING: Certificate verification error for self signed
certificate in certificate chain
---request begin---
AN..pdf HTTP/1.0
User-Agent: Wget/1.10.2
Accept: */*
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Server: Lotus-Domino
Date: Thu, 01 Mar 2007 15:57:55 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 6720
Pragma: no-cache
---response end---
  HTTP/1.1 200 OK
  Server: Lotus-Domino
  Date: Thu, 01 Mar 2007 15:57:55 GMT
  Connection: close
  Expires: Tue, 01 Jan 1980 06:00:00 GMT
  Content-Type: text/html; charset=UTF-8
  Content-Length: 6720
  Pragma: no-cache
Length: ignored [text/html]
[ =
] 6,720 --.--K/s
Closed 1932/SSL 0x9318c8
11:01:08 (309.48 KB/s) - `TR 2004.018 AEGIS TEST PLAN..pdf.4' saved [6720]

Dewayne R. Smith 

SPAWAR Systems Center Charleston 

Code 613, Special Projects Branch 

Office (843) 218-4393

Mobile (843) 696-9472


RE: how to get images into a new directory/filename heirarchy? [GishPuppy]

2007-02-23 Thread Tony Lewis
If it were me, I'd grab all the files to my local drive and then write
scripts to do the moving and renaming.

-Original Message-
Sent: Friday, February 23, 2007 1:33 AM
Subject: how to get images into a new directory/filename heirarchy?


I'm trying to use wget to download 100s of JPGs into a cache server with a
different directory/filename heirarchy. What I tried to do was to create a
text or html file with 1 line for each download (e.g. URL -nd -P [new-path]
-O [new-filename]) and use the --input-file= switch, However, I discovered
that I cannot rename the path/filename of the file inside the input file.

Also, the JPGs will not all come from the same domain but they need to be
placed in a flattened directory tree with different filenames.

Can anyone offer me advice on how to best accomplish this? I'm using the
windows platform.


Gishpuppy | To reply to this email, click here:[EMAIL PROTECTED]

RE: php form

2007-02-22 Thread Tony Lewis
The table stuff just affects what's shown on the user's screen. It's the
input field that affects what goes to the server; in this case, that's
input ... name=country ... so you want to post country=US. If there were
multiple fields, you would separate them with ampersands such as


From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 5:27 PM
To: Tony Lewis;
Subject: Re: php form

Thanks.  I have to log in with username/password, and I think I
know how to do that with wget using POST.  For the actual search page, the
HTML source says it`s: 
form action=full_search.php method=POST
However, I`m not clear on how to convey the data for the search.  
The search for has defined a table.  One of the entries, for example, is:
  tdbfont face=ArialSearch by Country:/font/b/td
  tdinput type=text name=country size=50 maxlength=100/td
If I want to use wget to search for entries in the U.S. (US), then how do
I convey this when I post to the php?
Thanks, Alan 

- Original Message - 
From: Tony Lewis mailto:[EMAIL PROTECTED]  
To: 'Alan Thomas' mailto:[EMAIL PROTECTED]  ; 
Sent: Thursday, February 22, 2007 12:53 AM
Subject: RE: php form

Look for form action=some-web-page method=XXX ...
action tells you where the form fields are sent.
method tells you if the server is expecting the data to be sent using a GET
or POST command; GET is the default. In the case of GET, the arguments go
into the URL. If method is POST, follow the instructions in the manual.
Hope that helps.


From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 21, 2007 4:39 PM
Subject: php form

There is a database on a web server (to which I have access) that is
accessible via username/password.  The only way for users to access the
database is to use a form with search criteria and then press a button that
starts a php script that produces a web page with the results of the search.
I have a couple of questions:
1.  Is there any easy way to know exactly what commands are behind the
button, to duplicate them?
2.  If so, then do I just use the POST command as described in the manual,
after logging in (per the manual), to get the data it provides.  
I have used wget just a little, but I am completely new to php.  
Thanks, Alan

RE: SI units

2007-01-15 Thread Tony Lewis
Lars Hamren wrote: 

 Download speeds are reported as K/s, where, I assume, K is short for

 The correct SI prefix for thousand is k, not K:

SI units are for decimal-based numbers (that is powers of 10) whereas
computer programs typically use binary-based numbers (powers of 2). It's
convenient for humans to equate 10^3 (1,000) with 2^10 (1,024) but with
large numbers, these values quickly diverge: 999k or 999 * 10^3 = 999,000,
but 999K or 999 * 2^10 = 1,022,976.

For what it's worth, according to Wikipedia either k or K is acceptable for

RE: SI units

2007-01-15 Thread Tony Lewis
Christoph Anton Mitterer wrote: 

 I don't agree with that,.. SI units like K/M/G etc. are specified by
 international standards and those specify them as 10^x.

 The IEC defined in IEC 60027 symbols for the use with base 2 (e.g. Ki, Mi,

All of this is described in the Wikipedia article I referenced.

It's true that International Electrotechnical Commission prefers the term
kibibytes and the prefix Ki for 1,024, but it's still not a term commonly
used in computer standards.

Searching there are 1,880 matches for kilobytes and only 2 for
kibibytes and those are both feedback from one individual arguing for the
use of kibibytes instead of kilobytes.

Searching there are 452 matches for kilobytes and only 5 for
kibibytes and even then, the following appears:  `KiB' kibibyte: 2^10 =
1024. `K' is special: the SI prefix is `k' and the IEC 60027-2 prefix is
`Ki', but tradition and POSIX use `k' to mean `KiB'.

It seems odd to me that one would suggest that wget is the place to start
changing the long-established trend of using 'k' for 1,024.

RE: wget question (connect multiple times)

2006-10-17 Thread Tony Lewis
A) This is the list for reporting bugs. Questions should go to

B) wget does not support multiple time simultaneously

C) The decreased per-file download time you're seeing is (probably) because
wget is reusing its connection to the server to download the second file. It
takes some time to set up a connection to the server regardless of whether
you're downloading one byte or one gigabyte of data. For small files, the
set up time can be a significant part of the overall download time.

Hope that helps!

-Original Message-
From: t u [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 17, 2006 3:50 PM
Subject: wget question (connect multiple times)

Hash: SHA1

I hope it is okay to drop a question here.

I recently found that if wget downloads one file, my download speed will be
Y, but if wget downloads two separate files (from the same server, doesn't
matter), the download speed for each of the files will be Y (so my network
speed will go up to 2 x Y).

So my question is, can I make wget download the same file multiple times
simultaneously? In a way, it would run as multiple processes and download
parts of the file at the same time, speeding up the download.

Hope I could explain my question, sorry about the bad english.


PS. Please consider this as an enhancement request if wget cannot get a file
by downloading parts of it simultaneously.
Version: GnuPG v1.4.2.2 (GNU/Linux)


RE: I got one bug on Mac OS X

2006-07-16 Thread Tony Lewis
Hrvoje Niksic wrote:

 HTML has been maintained by W3C for many years 

I knew that (but forgot) -- just went to out of habit looking for
Internet specifications.


RE: I got one bug on Mac OS X

2006-07-15 Thread Tony Lewis

I don't think that's valid HTML. According to RFC 
1866: An HTML user 
agent should treat end of line in any of its variations as a word space in all 
contexts except preformatted text.
don't see any provision for end of line within the HREF attribute of an A 


Sent: Tuesday, July 11, 2006 7:48 AMTo: 
[EMAIL PROTECTED]Subject: I got one bug on Mac OS 

Dear Sir/Madam,
while I was trying to download 
using the command:

wget -k -np -r 
-l inf -E

I got most of the files, but lost some of them.

I think I know where the problem is:

if the link is broken into two lines in the index.html:

PLecture 1 (Jan 17): Exploring Conformational Space 
for Biomolecules
A HREF=""">

I will get the following error 

Connecting to[]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
09:13:16 ERROR 404: Not Found.

Please note that wget adds a special charactor '%0A' in the URL. Maybe the 
Windows new line have one more charactor which is not recoganized by Mac 

I am using Mac OS X, Tigger Darwin.


RE: wget - Returning URL/Links

2006-07-10 Thread Tony Lewis
Mauro Tortonesi wrote:

 perhaps we should modify wget in order to print the list of touched
 URLs as well? maybe only in case -v is given? what do you think?

On June 28, 2005, I submitted a patch to write unfollowed links to a file.
It would be pretty simple to have a similar --followed-links option.



2006-07-03 Thread Tony Lewis
Title: RE: BUG

Run the command with -d and post the output here.



From:  Junior + Suporte [mailto:[EMAIL PROTECTED]] 

Sent: Monday, July 03, 2006 2:00 PM


Subject: BUG


I using wget to send login request to a site, when wget is saving the cookies, the following error message appear:

Error in Set-Cookie, field `Path'Syntax error in Set-Cookie: tu=661541|802400391

@TERRA.COM.BR; Expires=Thu, 14-Oct-2055 20:52:46 GMT; Path= at position 78.


uv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp [following]

I trying to access URL[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1

In Internet Explorer, this URL work correctly and the cookie is saved in the local machine, but in WGET, this cookie return an error. 


Luiz Carlos Zancanella Junior

RE: wget - tracking urls/web crawling

2006-06-23 Thread Tony Lewis
Bruce wrote: 

 any idea as to who's working on this feature?

Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started working on the feature or not.


RE: Batch files in DOS

2006-06-05 Thread Tony Lewis
I think there is a limit to the number of characters that DOS will accept on
the command line (perhaps around 256). Try putting echo in front of the
command in your batch file and see how much of it gets echoed back to you.
As Tobias suggested, you can try moving some of your command line options
into the .wgetrc file.

-Original Message-
Sent: Saturday, June 03, 2006 2:46 PM
Subject: Batch files in DOS

I'm trying to mirror about 100 servers (small fanfic sites) using wget
--recursive --level=inf,, some_address However,
when I run the batch file, it stops reading after a while; apparently my
command has too many characters.  Is there some other way I should be doing
this, or a workaround?

GNU Wget 1.10.1 running on Windows 98


RE: I cannot get the images

2006-05-15 Thread Tony Lewis
The problem is your accept list; -A*.* says to accept any file that contains
at least one dot in the file name and
GetFile?id=DBJOHNUNZIOCSBMOMKRUconvert=image%2Fgifscale=3 doesn't contain
any dots.

I think you want to accept all files so just delete -A*.* from your argument
list because the default behavior is to accept everything.

-Original Message-
From: matis [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 15, 2006 6:09 AM
Subject: I cannot get the images

Im trying to get whole directory but images from database are ignored. If
you paste the address below this post to the browser (or even flashget) it
will download the image and open with a default extension .gif . But wget
informs file should be removed and then remove it :/ . As effect when
there's a picture on every page (with the address as below) only empty htmls
are downloaded. Does anybody knows what to do?

The address (with the wget command used by me):
wget --cache=off -p -m -erobots=off -t10 -v -A*.*
whole html address (broken):


RE: wget post-data/cookie problem

2006-05-04 Thread Tony Lewis
Erich Steinboeck wrote:

 Is there a way to trace the browser traffic and compare
 that to the wget traffic, to see where they differ.

You can use a web proxy. I like Achilles: 


RE: Defining url in .wgetrc

2006-04-20 Thread Tony Lewis
ks wrote: 

 Just one more question.
 Something like this inside somefile.txt
 -r -o gnulog

Why not use a batch file or command script (depending on what OS you're
using) containing something like:

wget -r -o gnulog
wget -S


RE: Windows Title Bar

2006-04-18 Thread Tony Lewis
Hrvoje Niksic wrote:

 Anyway, adding further customizations to an already questionnable feature
 is IMHO not a very good idea. 

Perhaps Derek would be happy if there were a way to turn off this
questionable feature.


RE: dose wget auto-convert the downloaded text file?

2006-04-16 Thread Tony Lewis
18 mao [EMAIL PROTECTED] wrote:

 then  save the page as 2.html with the FireFox browser

You should not assume that the file saved by any browser is the same as the
file delivered to the browser by the server. The browser is probably
manipulating line endings to match the conventions on your operating system
when it saves files so that CR, CR-LF, or LF, all become CR-LF (or whatever
your OS uses for line endings).


RE: download of images linkes in css does not work

2006-04-13 Thread Tony Lewis
It's not a bug; it's a (missing) feature. 
-Original Message-
From: Detlef Girke [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 13, 2006 3:17 AM
Subject: download of images linkes in css does not work

I tried everything, but images, built in via CSS are neither downloaded nor
related with wget.
Example (inline style):
CSS-Terms like

div style=background-image : 
url(/files/inc/image/pjpeg/hintergrund_startseite.jpg); id=pfad 

do not have any effect on the downloaded web-page.

The same thing happens, when you write

{background-image : url(/files/inc/image/pjpeg/hintergrund_startseite.jpg);}

into a css-file.

Maybe other references in CSS do not work either. Perhaps you can prove

If you could fix this problem, wget would be the best tool for me.
Thank you and best regards

Detlef Girke, BIK Hamburg, Beratung, Tests und Workshops c/o DIAS GmbH,
Neuer Pferdemarkt 1, 20359 Hamburg [EMAIL PROTECTED],, 040 43187513,Fax 040 43187519

RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: 

 no. i was talking about regexps. they are more expressive
 and powerful than simple globs. i don't see what's the
 point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their
expressions will simply work, which will result in significant confusion
when some expression doesn't work, such as
--filter:-domain:www-* :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:


Internally, wget would convert the first into the second and then treat it
as a regular expression. For the vast majority of cases, glob will work just

One might argue that it's a lot of work to implement regular expressions if
the default input format is a glob, but I think we should aim for both lack
of confusion and robust functionality. Using ,r means people get regular
expressions when they want them and know what they're doing. The universe of
wget users who know what they're doing are mostly subscribed to this
mailing list; the rest of them send us mail saying please CC me as I'm not
on the list. :-)

If we go this route, I'm wondering if the appropriate conversion from glob
to regular expression should take directory separators into account, such


becoming the same as:


or even:


Should the glob match path/to/sub/dir? (I suspect it shouldn't.)


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: 

 But that misses the point, which is that we *want* to make the
 more expressive language, already used elsewhere on Unix, the

I didn't miss the point at all. I'm trying to make a completely different
one, which is that regular expressions will confuse most users (even if you
tell them that the argument to --filter is a regular expression). This
mailing list will get a huge number of bug reports when users try to use
globs that fail.

Yes, regular expressions are used elsewhere on Unix, but not everywhere. The
shell is the most obvious comparison for user input dealing with expressions
that select multiple objects; the shell uses globs.

Personally, I will be quite happy if --filter only supports regular
expressions because I've been using them quite effectively for years. I just
don't think the same thing can be said for the typical wget user. We've
already had disagreements in this chain about what would match a particular
regular expression; I suspect everyone involved in the conversation could
have correctly predicted what the equivalent glob would do.

I don't think ,r complicates the command that much. Internally, the only
additional work for supporting both globs and regular expressions is a
function that converts a glob into a regexp when ,r is not requested.
That's a straightforward transformation.


RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't see a clear line that connects --filter to glob patterns as used
 by the shell.

I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf

Note that *.pdf is not a valid regular expression even though it's what
most people will try naturally. Perl complains:
/*.pdf/: ?+*{} follows nothing in regexp

I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.


RE: regex support RFC

2006-03-30 Thread Tony Lewis
How many keywords do we need to provide maximum flexibility on the
components of the URI? (I'm thinking we need five.)


--filter=uri:regex could match against any part of the URI
--filter=domain:regex could match against
--filter=path:regex could match against /path/to/script.cgi
--filter=file:regex could match against script.cgi
--filter=query:regex could match against foo=bar

I think there are good arguments for and against matching against the file
name in path:


RE: regex support RFC

2006-03-30 Thread Tony Lewis
Curtis Hatter wrote:

 Also any way to add modifiers to the regexs? 

Perhaps --filter=path,i:/path/to/krs would work.


RE: Bug in ETA code on x64

2006-03-28 Thread Tony Lewis
Hrvoje Niksic wrote:

 The cast to int looks like someone was trying to remove a warning and
 botched operator precedence in the process.

I can't see any good reason to use , here. Why not write the line as:
  eta_hrs = eta / 3600; eta %= 3600;

This makes it much less likely that someone will make a coding error while
editing that section of code.


RE: wget option (idea for recursive ftp/globbing)

2006-03-02 Thread Tony Lewis
Mauro Tortonesi wrote: 

 i would like to read other users' opinion before deciding which
 course of action to take, though.

Other users have suggested adding a command line option for -a two or
three times in the past:

- 2002-11-24: Steve Friedl [EMAIL PROTECTED] submitted a patch
- 2002-12-24: Maaged Mazyek [EMAIL PROTECTED] submitted a patch
- 2005-05-09: B Wooster [EMAIL PROTECTED] asked if the fix was ever
going to be implemented
- 2005-08-19: Carl G. Ponder [EMAIL PROTECTED] asked if the patches
were going to be applied
- 2005-08-20: Hrvoje responded by posting his own patch for --list-options

(and that's just what I can find in my local archive searching for list

There is clearly a need among the user community for a feature like this and
lots of ideas about how to implement it. I'd say you should pick one and
implement it.

If you need copies of any of the patches mentioned in the list above, let me


RE: wget 1.10.x fixed recursive ftp download over proxy

2006-01-10 Thread Tony Lewis
Here's what your version of the code said:

if ((opt.recursive || opt.page_requisites)
 ((url_scheme (*t) != SCHEME_FTP) || 
(opt.use_proxy  url_scheme (*t) == SCHEME_FTP)))

which means (for the bit after the ):

proxyT T
^proxy   F T

Regardless of whether it is FTP, the condition will always succeed if
use_proxy is true. Therefore, a simpler way of writing the expression is:
(url_scheme (*t) != SCHEME_FTP) || opt.use_proxy

You're right that I shouldn't have moved opt.use_proxy with the other
command line options. My revised suggestion is:

if ((opt.recursive || opt.page_requisites)
 ((url_scheme (*t) != SCHEME_FTP) || opt.use_proxy))
  status = retrieve_tree (*t);
  status = retrieve_url
  (*t, filename, redirected_URL, NULL, dt);

-Original Message-
Sent: Tuesday, January 10, 2006 5:06 PM
To: Tony Lewis
Subject: Re: wget 1.10.x fixed recursive ftp download over proxy

Your simplified code may not work. The intention of patching is to make wget
invoke retrieve_tree funtion when it IS FTP and uses proxy, while your
code works when it is NOT FTP and uses proxy.

On 1/10/06, Tony Lewis [EMAIL PROTECTED] wrote:

 I believe the following simplified code would have the same effect:

 if ((opt.recursive || opt.page_requisites || opt.use_proxy)  
 url_scheme (*t) != SCHEME_FTP)
   status = retrieve_tree (*t);
   status = retrieve_url
   (*t, filename, redirected_URL, NULL, dt);



RE: wget 1.10.x fixed recursive ftp download over proxy

2006-01-09 Thread Tony Lewis

I believe the following simplified code would have the same 
if ((opt.recursive || opt.page_requisites || opt.use_proxy)  url_scheme 
(*t) != SCHEME_FTP) status = retrieve_tree (*t);else 
status = retrieve_url  (*t, filename, redirected_URL, NULL, 

[mailto:[EMAIL PROTECTED] On Behalf Of CHEN PengSent: 
Monday, January 09, 2006 12:38 AMTo: 
[EMAIL PROTECTED]Subject: wget 1.10.x fixed recursive ftp download 
over proxy
We once encounter an annoying problem of recursively downloading FTP data 
using wget, through a ftp-over-http proxy. Previously it was the proxy firmware 
that does not support recursive downloads, but even upgrading we realized there 
is problem with wget itself as well.
We found that with new proxy firmwire, the older wget 1.7.x can download FTP 
database recursively, but the newer version (1.9.x and 1.10.x) can not. That 
means there must be something wrong with the code.
I also confirmed this is a known bug for wget since 2003 and it is strange it 
has not been fixed for a long time.
To fix this problem, I took some time to analyze its code and it happens wget 
uses different method to get the list of files for a destination folder when 
trying to do recursive download. For normal FTP, it uses FTP command "LIST" to 
get the file listing. For normal HTTP, it uses its internal method 
"retrieve_tree()" to generate the lists. 
In main.c, it does to use retrieve_tree() function to generate list if the 
traffic is FTP. Howerver, when we use ftp-over-http proxy, the actual request to 
the server is HTTP request, where the "LIST" FTP command wont work, so we only 
get one "index.html" file.
if ((opt.recursive || opt.page_requisites)  
url_scheme (*t) != SCHEME_FTP) status = retrieve_tree 
(*t);else status = retrieve_url  (*t, filename, 
redirected_URL, NULL, dt);
In this scenario, we need to modify the code to force wget call retrieve_tree 
function for FTP traffic if the proxy is involved 
if ((opt.recursive || opt.page_requisites)//  url_scheme (*t) != 
SCHEME_FTP) ((url_scheme (*t) != 
SCHEME_FTP) ||  (opt.use_proxy 
 url_scheme (*t) == SCHEME_FTP))) status = 
retrieve_tree (*t);else status = retrieve_url  (*t, 
filename, redirected_URL, NULL, dt);
After patching the main.c, the new wget works perfectly for FTP recursive 
downloading, both with proxy and without proxy. This patching works for 1.9.x 
and 1.10.x till the latest version so far (1.10.2).-- CHEN Peng [EMAIL PROTECTED] 

RE: spaces in pathnames using --directory-prefix=prefix

2005-11-30 Thread Tony Lewis
Jonathan DeGumbia wrote:
 I'm trying to use the --directory-prefix=prefix option for wget on a
 Windows system.  My prefix has spaces in the path directories.  Wget
 appears to terminate the path at the first space encountered.   In other
 words if my prefix is: c:/my prefix/   then wget copies files to c:/my/ .

 Is there a work-around for this?

wget is not terminating the path at the command line delimiter, Windows is.
In the same way that you have to enter:
dir c:\my prefix
to list the contents of the directory, you have to enter:
wget --directory-prefix=c:/my prefix
or the command processor will split the directory path at the space before
passing it to wget.


RE: Error connecting to target server

2005-11-11 Thread Tony Lewis

 Thanks for your reply. Only ping works for and not wget.

When I issue the command wget, it successfully downloads the
following file:

META HTTP-EQUIV=Refresh content=0; URL=;
TITLEBritish Broadcasting Corporation /TITLE

You might want to try wget;.

I think should have another
question: Why did my download fail and how can I get it to work? In the
answer to that question we should mention all the common failure modes:
disallowed by robots.txt, need to set user agent to look like a browser,
META refresh (as above), etc. along with the command line options to resolve
the failure.

Also, perhaps the next version of wget can handle META refresh.


RE: wget can't handle large files

2005-10-18 Thread Tony Lewis
Eberhard Wolff wrote: 

 Apparently wget can't handle large file.
 wget --version GNU Wget 1.8.2

This bug was fixed in version 1.10 of wget. You should obtain a copy of
the latest version, 1.10.2.


RE: Wget patches for .files

2005-08-19 Thread Tony Lewis
Mauro Tortonesi wrote: 

 this is a very interesting point, but the patch you mentioned above uses
 LIST -a FTP command, which AFAIK is not supported by all FTP servers.

As I recall, that's why the patch was not accepted. However, it would be
useful if there were some command line option to affect the LIST parameters.
Perhaps something like:

wget --ftp-list=-a


RE: wget a file with long path on Windows XP

2005-07-21 Thread Tony Lewis
PoWah Wong wrote: 

 The login page is:

 How to figure out the login command?

 These two commands do not work:

 wget --save-cookies cookies.txt [snip]
 wget --save-cookies cookies.txt [snip]

When trying to recreate a form in wget, you have to send the data the server
is expecting to receive to the location the server is expecting to receive
it. You have to look at the login page for the login form and recreate it.
In your browser, view the source to
and you will find the form that appears below. Note that I stripped out
formatting information for the table that contains the form and reformatted
what was left to make it readable.

form action=JVXSL.asp method=post
  input type=hidden name=s value=1
  input type=hidden name=o value=1
  input type=hidden name=b value=1
  input type=hidden name=t value=1
  input type=hidden name=f value=1
  input type=hidden name=c value=1
  input type=hidden name=u value=1
  input type=hidden name=r value=
  input type=hidden name=l value=1
  input type=hidden name=g value=
  input type=hidden name=n value=1
  input type=hidden name=d value=1
  input type=hidden name=a value=0
  input tabindex=1 name=usr id=usr type=text value= size=12
  input name=pwd id=pwd tabindex=1 type=password value=
  input type=checkbox tabindex=1 name=savepwd id=savepwd value=1
  input type=image name=Login src=images/btn_login.gif alt=Login
width=40 height=16 border=0 tabindex=1 align=absmiddle

Note that the server expects the data to be posted to JVXSL.asp and that
there are a bunch of fields that must be supplied in order for the server to
process the login request. In addition, the two fields you supply are called
usr and pwd. So your first wget command line will look something like

wget --save-cookies cookies.txt;
[EMAIL PROTECTED]pwd=123savepwd=1

Hope that helps!


RE: connect to server/request multiple pages

2005-07-21 Thread Tony Lewis

Pat Malatack wrote:
 is there a 
way to stay connected, because it seems to me that this takes a decent amount of 
time that could be minimized

The following 
command will do what you want:

wget "" 


RE: Invalid directory names created by wget

2005-07-08 Thread Tony Lewis
Larry Jones wrote: 

 Of course it's directly accessible -- you just have to quote it to keep
 shell from processing the parentheses:

   cd 'title.Die-Struck+(Gold+on+Gold)+Lapel+Pins'

You can also make the individual characters into literals:

cd title.Die-Struck+\(Gold+on+Gold\)+Lapel+Pins


Name or service not known error

2005-06-27 Thread Tony Lewis

I got a "Name or 
service not known" error from wget 1.10 running on Linux. When I installed an 
earlier version of wget, it worked just fine.It also works just fine on 
version 1.10 running on Windows. Any ideas?

Here's the output on 

wget --versionGNU Wget 1.9-beta1

= `index.html.8'Resolving, to[]:80... 
connected.HTTP request sent, awaiting response... 200 OKLength: 45,166 

45,166 166.21K/s

17:30:01 (166.17 KB/s) - `index.html.8' saved 

snip output from 
make install of 1.10 here

wget --versionGNU Wget 1.10

= `index.html.9'Resolving failed: Name or service 
not known.

RE: Removing thousand separators from file size output

2005-06-24 Thread Tony Lewis
Hrvoje Niksic wrote: 

 In fact, I know of no application that accepts numbers as Wget prints

Microsoft Calculator does.


RE: Is it just that the -m (mirror) option an impossible task [Was: wget 1.91 skips most files]

2005-05-28 Thread Tony Lewis
Maurice Volaski wrote:

 wget's -m option seems to be able to ignore most of the files it should
 download from a site. Is this simply because wget can download only the
 files it can see? That is, if the web server's directory indexing option
 is off and a page on the site is present on the server, but it isn't
 referenced by any publicly viewable page, wget simply can't see it. 

I've been thinking about coding a --extra-sensory-perception option that
would cause wget to read the mind of the server so that it can download
files that it cannot see. As soon as I get the algorithm worked out, I'll be
submitting the patch. So far I've figured out how to download index.html
without being able to see it, but I'm sure that if I keep working at it that
wget will be able to detect the rest of the files it cannot see. Of course,
I could just be taking the wrong approach; it may work better if I try to
implement the --psychic option instead.


RE: links conversion; non-existent index.html

2005-05-01 Thread Tony Lewis
Andrzej wrote:

 Two problems:

 There is no index.html under this link:
 it creates a non existing link:

When you specify a directory, it is up to the web server to determine what
resource gets returned. Some web servers will return a directory listing,
some will return some file (such as index.html), and others will return an

For example, Apache might return (in this order): index.html, index.htm, a
directory listing (or a 403 Forbidden response if the configuration
disallows directory listings). The actual list of files that Apache will
search for and the order in which they are selected is determined by the

If the web server returns any information, wget has to save the information
that is returned in *some* local file. It chooses to name that local file
index.html since it has no way of knowing where the information might have
actually been stored on the server.

Hope that helps,


RE: SSL options

2005-04-21 Thread Tony Lewis
Hrvoje Niksic wrote:

 The question is what should we do for 1.10?  Document the
 unreadable names and cryptic values, and have to support
 them until eternity?

My vote is to change them to more reasonable syntax (as you suggested
earlier in the note) for 1.10 and include the new syntax in the
documentation. However, I think wget should to continue to support the old
options and syntax as alternatives in case people have included them in


RE: newbie question

2005-04-14 Thread Tony Lewis
Alan Thomas wrote:

 I am having trouble getting the files I want using a wildcard specifier...

There are no options on the command line for what you're attempting to do.

Neither wget nor the server you're contacting understand *.pdf in a URI.
In the case of wget, it is designed to read web pages (HTML files) and then
collect a list of resources that are referenced in those pages, which it
then retrieves. In the case of the web server, it is designed to return
individual objects on request (X.pdf or Y.pdf, but not *.pdf). Some web
servers will return a list of files if you specify a directory, but you
already tried that in your first use case.

Try coming at this from a different direction. If you were going to manually
download every PDF from that directory, how would YOU figure out the names
of each one? Is there a web page that contains a list somewhere? If so,
point wget there.

Hope that helps.


PS) Jens was mistaken when he said that https requires you to log into the
server. Some servers may require authentication before returning information
over a secure (https) channel, but that is not a given.

RE: File rejection is not working

2005-04-06 Thread Tony Lewis
Jens Rösner wrote: 

 AFAIK, RegExp for (HTML?) file rejection was requested a few times, but is
 not implemented at the moment.

It seems all the examples people are sending are just attempting to get a
match that is not case sensitive. A switch to ignore case in the file name
match would be a lot easier to implement than regular expressions and solve
the most pressing need.

Just a thought.


RE: help!!!

2005-03-21 Thread Tony Lewis
The --post-data option was added in version 1.9. You need to upgrade your
version of wget. 

-Original Message-
From: Richard Emanilov [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 21, 2005 8:49 AM
Subject: RE: help!!!

wget --http-user=login --http-passwd=passwd
--post-data=login=loginpassword=passwd https://site

wget: unrecognized option `--post-data=login=loginpassword=password'
Usage: wget [OPTION]... [URL]... 

wget --http-user=login --http-passwd=passwd
--http-post=login=loginpassword=password https:site
wget: unrecognized option `--http-post=login=loginpassword=passwd'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

wget -V
GNU Wget 1.8.2

Richard Emanilov

-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Monday, March 21, 2005 10:26 AM
Cc: Richard Emanilov
Subject: RE: help!!!

Richard Emanilov wrote:

 Below is what I have tried with no success

 wget --http-user=login --http-passwd=passwd

That should be:
wget --http-user=login --http-passwd=passwd


RE: Curb maximum size of headers

2005-03-17 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't see how and why a web site would generate headers (not bodies, to
 be sure) larger than 64k.

To be honest, I'm less concerned about the 64K header limit than I am about
limiting a header line to 4096 bytes. I don't know any sites that send back
header lines that long, but they could. Who's to say some site doesn't have
a 4K cookie?

Since you already proposing to limit the entire header to 64K, what is
gained by adding this second limit?


RE: one bug?

2005-03-04 Thread Tony Lewis

Jesus Legido wrote:

 I'm getting a file from

The problem is not with wget. The file on the server 
starts with 0xFF 0xFE. Put the following into an HTML file (say temp.html) on 
your hard drive, open it in your web browser, right click on the link and do a 
"Save As..." to your hard drive. You will get the same thing as wget 

htmlbodya href="">ea.txt"ea.txt/a/body/html

RE: wget: question about tag

2005-02-02 Thread Tony Lewis
Normand Savard wrote:

 I have a question about wget.  Is is possible to download other attribute
 value other than the harcoded ones?

No, at least not in the existing versions of wget. I have not heard that
anyone is working on such an enhancement.

RE: new string module

2005-01-05 Thread Tony Lewis
Mauro Tortonesi wrote:

 Alle 18:28, mercoled 5 gennaio 2005, Draen Kacar ha scritto:
  Jan Minar wrote:
   What's wrong with mbrtowc(3) and friends?  The mysterious solution 
   is probably to use wprintf(3) instead printf(3).  Couple of 
   questions on #c on freenode would give you that answer.
  Historically, wget source was written in a way which allowed one to 
  compile it on really old systems. That would rule out C95 functions.
  (I'm not advocating this approach, just answering the question.)
 as long as i am the maintainer of wget, backward compatibility on very old or 
 legacy systems will NOT be broken.

I don't think it has be an either/or situation. With well-selected #if 
statements, you should be able to have something that works on legacy systems 
while still providing wide character support on more modern operating systems.

I'm not volunteering to determine what those #if statements might be :-) ... 
just pointing out the possibility.


RE: Metric units

2004-12-23 Thread Tony Lewis
John J Foerch wrote:

 It seems that the system of using the metric prefixes for numbers 2^n is a
 simple accident of history.  Any thoughts on this?

I would say that the practice of using powers of 10 for K and M is a
response to people who cannot think in binary.


RE: Metric units

2004-12-23 Thread Tony Lewis
Carlos Villegas snidely wrote:

 I would say that the original poster understands what he is saying, and
you clearly don't...

I'll put my computer science degree up against your business administration
and accounting degree any day.

A kilobyte has always been 1024 bytes and the choice was not accidental.
Computer memory is laid out in bits, which are always powers of two.

There are 10 kinds of people in the world; those who understand binary and
those who don't.


RE: Metric units

2004-12-23 Thread Tony Lewis
Mark Post wrote: 

 While we're at it, why don't we just round off the value of pi to be 3.0

Do you live in Indiana?

Actually, Dr. Edwin Goodwin wanted to round off pi to any of several values
including 3.2.


RE: date based retrieval

2004-12-19 Thread Tony Lewis
Anthony Caetano wrote:

 I am looking for a way to stay current without mirroring an entire site.

 Does anyone else see a use for this?

Yes. Here's my non-wget solution. I truncate all the files in the
directories that I don't want, but maintain the date/time accessed and
modified. The Perl script I use follows. --Tony


$dir = shift @ARGV;
die No such directory: $dir\n unless -d $dir;

$nFiles = 0;
my @dirs = ($dir);
while (scalar(@dirs)) {
  processDir(shift @dirs);
print Truncated $nFiles files.\n;

sub processDir
  my ($dir) = @_;
  opendir DIR, $dir or die Cannot read directory: $dir\n;
  foreach $file (readdir DIR)
next if $file eq . || $file eq ..;

$path = $dir/$file;
if (-d $path) {
  push @dirs, $path;
} else {
  my @stat = stat($path);
  open F, $path or next;
  close F;
  utime $stat[8], $stat[9], $path;
  closedir DIR;

Re: wput mailing list

2004-08-29 Thread Tony Lewis
Justin Gombos wrote:
Since I feel that computers serve man, not the reverse, so I don't
intend to change my file organization to be web page centric.  Looking
around the web, I was quite surprized to find that I'm the only one
with this problem.  I was very relieved to find that there was a wput
- then disappointed to find that wput doesn't reverse one of the most
important capabilities of wget; that is, the ability to selectively
mirror only files that are needed.
Sounds like you've got a legitimate feature request for wput. You should 
submit it to *their* mailing list. Even better, borrow the relevant code 
from wget and upgrade wput yourself.


Re: Stratus VOS support

2004-07-28 Thread Tony Lewis
Stratus VOS supportJonathan Grubb wrote:

 Any thoughts of adding support for Stratus VOS file structures?

Your question is a little too vague -- even for me (I used to work for
Stratus and actually know what VOS is :-)

What file structures are you needing supported that wget does not currently
support? Are you needing support when wget is running on Stratus and saving
files that it downloads from somewhere else? Or are you needing support for
VOS file structures when wget is retrieving a file from a VOS system? If the
latter, does this support only apply if both systems are VOS?


Re: Stratus VOS support

2004-07-28 Thread Tony Lewis
Jonathan Grubb wrote:

 Um. I'm using wget on Win2000 to ftp to a VOS machine. I'm finding that
 usual '' sign for directories isn't supported by wget and that '/'
 work either, I think because the ftp server itself is expecting ''.

The problem may be that Win 2000 grabs the  before wget ever sees it. Try
putting the path in quotes.


Re: retrieve a whole website with image embedded in css

2004-07-13 Thread Tony Lewis
Ploc wrote:

 The result is a website very different from the original one as it lacks

 Can you please confirm if what I think is true (or not), if it is
 registered as a bug, and if there is a date planning to correct it.

It is true. wget only retrieves objects that appear in the HTML. It does not
parse the CSS or JavaScript used by a site.


Re: retrieve a whole website with image embedded in css

2004-07-13 Thread Tony Lewis
Ploc wrote:

 Is it already registered as a bug or in a whishlist ?

It's not a bug. This feature has been on the wishlist for a long time.


Re: retrieve a whole website with image embedded in css

2004-07-13 Thread Tony Lewis
Ploc wrote:

  It's not a bug. This feature has been on the wishlist for a long time.
 Ok, thank you.
 Do you know if this wish will be included in the next release ?

I'm not aware that anyone is working on it. Volunteers?


Re: question on wget via http proxy

2004-07-12 Thread Tony Lewis
Malte Schünemann wrote:

 Since wget is able to obtain directoy listings / retrieve data from
 there is should be possible to also upload data

Then it would be wput. :-)

 What is so special about wget that it is able to perform this task?

You can learn a LOT about how wget is communicating with the target site by
using the --debug argument.

Hope that helps a little.


Re: Escaping semicolons (actually Ampersands)

2004-06-28 Thread Tony Lewis
Phil Endecott wrote:

 Tony The stuff between the quotes following HREF is not HTML; it
 Tony is a URL. Hence, it must follow URL rules not HTML rules.

 No, it's both a URL and HTML.  It must follow both rules.

 Please see the page that I cited in my previous message:

I've looked at hundreds of web pages and I've never seen anyone put amp;
into HREF in  place of an ampersand.


Re: Escaping semicolons

2004-06-27 Thread Tony Lewis
Phil Endecott wrote:

 There is not much to go on in terms of specifications.  The closest is
 RFC1738, which includes BNF for a file: URI.  However it is ten years
 old, so whether it reflects current practice I do not know.  But it does
 not allow ; in file: URIs.

 I conclude from this that wget should be replacing ; with its %3b escape

I think you're confusing what wget is required to do with URLs entered on
the command line and what it chooses to do with the resulting files that it
saves. If a unencoded name of retrieved resource cannot be stored on the
local file system, wget encodes it to create a valid name.

 Tony Lewis wrote:
   I use semicolons in CGI URIs to separate parameters.  (Ampersand
   is more often used for this, but semicolon is also allowed and
   has the advantage that there is no need to escape it in HTML.)
  There is no need to escape ampersands either.

 Tony, are you suggesting that this is legal HTML?

   a href=;Foo/a

 I'm fairly confident that you need to escape the  to make it valid, i.e.

   a href=;p2=v2;Foo/a

Just out of curiosity, did you try to implement your theory and see what
happens? If you did, you would that the first version works and the second
does not.

By the way, the correct URI encoding of ampersand is %26, not amp;. The
latter encoding is used for ampersands in HTML markup.

With regard to whether ampersand needs to be encoded, you're misreading the

   Many URL schemes reserve certain characters for a special meaning:
   their appearance in the scheme-specific part of the URL has a
   designated semantics. If the character corresponding to an octet is
   reserved in a scheme, the octet must be encoded.  The characters ;,
   /, ?, :, @, = and  are the characters which may be
   reserved for special meaning within a scheme. No other characters may
   be reserved within a scheme.

   Usually a URL has the same interpretation when an octet is
   represented by a character and when it encoded. However, this is not
   true for reserved characters: encoding a character reserved for a
   particular scheme may change the semantics of a URL.

The RFC says that you have to escape Reserved characters if that character
appears in the name of the resource you're trying to retrieve. That is, if
you're trying to retrieve a file named ab.txt, you refer to that file as
a%26b.txt in the URL because you're using the ampersand for a non-reserved

If you're using a reserved character for the purpose that it has been
reserved (in this case, separating parameters), you do NOT want to encode
it. The URL you proposed (after correcting the encoding of the ampersand) is
requesting a resource (probably a file) whose name is foo.cgi?p1=v1p2=v2.
It is NOT requesting that the script foo.cgi be executed with argument p1
having a value of v1 and p2 having a value of v2.

Hope that helps.


Re: Escaping semicolons

2004-06-24 Thread Tony Lewis
Phil Endecott wrote:

 I am using wget to build a downloadable zip file for offline viewing of
a CGI-intensive web site that I am building.

 Essentially it works, but I am encountering difficulties with semicolons.
I use semicolons in CGI URIs to separate parameters.  (Ampersand is more
often used for this, but semicolon is also allowed and has the advantage
that there is no need to escape it in HTML.)

There is no need to escape ampersands either.


  1   2   >