RE: wget re-download fully downloaded files

2008-10-27 Thread Tony Lewis
Micah Cowan wrote:

 Actually, I'll have to confirm this, but I think that current Wget will
 re-download it, but not overwrite the current content, until it arrives
 at some content corresponding to bytes beyond the current content.

 I need to investigate further to see if this change was somehow
 intentional (though I can't imagine what the reasoning would be); if I
 don't find a good reason not to, I'll revert this behavior.

One reason to keep the current behavior is to retain all of the existing
content in the event of another partial download that is shorter than the
previous one. However, I think that only makes sense if wget is comparing
the new content with what is already on disk.

Tony




RE: [PATCH] Enable wget to download from given offset and just a given amount of bytes

2008-10-23 Thread Tony Lewis
Juan Manuel wrote:

 

 OK, you are right, I`ll try to make it better on my free time. I

 supposed that it would have been more polite with one option, but

 thought it was easier with two (and since this is my first

 approach to C I took the easy way) because one option would have

 to deal with two parameters.

 

It's clearly easier to deal with options that wget is already programmed to
support. For a primer on wget options, take a look at this page on the wiki:
http://wget.addictivecode.org/OptionsHowto

 

I suspect you will need to add support for a new action (perhaps cmd_range).

 

Tony

 

 



RE: A/R matching against query strings

2008-10-22 Thread Tony Lewis
Micah Cowan wrote:

 Would hash really be useful, ever?

Probably not as long as we strip off the hash before we do the comparison.

Tony




RE: A/R matching against query strings

2008-10-21 Thread Tony Lewis
Micah Cowan wrote:

 On expanding current URI acc/rej matches to allow matching against query
 strings, I've been considering how we might enable/disable this
 functionality, with an eye toward backwards compatibility.

What about something like --match-type=TYPE (with accepted values of all,
hash, path, search)?

For the URL http://www.domain.com/path/to/name.html?a=true#content

all would match against the entire string
hash would match against content
path would match against path/to/name.html
search would match against a=true

For backward compatibility the default should be --match-type=path.

I thought about having host as an option, but that duplicates another
option.

Tony



RE: Big files

2008-09-16 Thread Tony Lewis
Cristián Serpell wrote:

 I would like to know if there is a reason for using a signed int for  
 the length of the files to download.

I would like to know why people still complain about bugs that were fixed
three years ago. (More accurately, it was a design flaw that originated from
a time when no computer OS supported files that big, but regardless of what
you call it, the change to wget was made to version 1.10 in 2005.)

Tony




RE: Big files

2008-09-16 Thread Tony Lewis
Cristián Serpell wrote:

 Maybe I should have started by this (I had to change the name of the  
 file shown):
[snip]
 ---response begin---
 HTTP/1.1 200 OK
 Date: Tue, 16 Sep 2008 19:37:46 GMT
 Server: Apache
 Last-Modified: Tue, 08 Apr 2008 20:17:51 GMT
 ETag: 7f710a-8a8e1bf7-47fbd2ef
 Accept-Ranges: bytes
 Content-Length: -1970398217

The problem is not with wget. It's with the Apache server, which told wget
that the file had a negative length.

Tony



RE: Wget and Yahoo login?

2008-08-21 Thread Tony Lewis
Micah Cowan wrote:

 The easiest way to do what you want may be to log in using your browser,
 and then tell Wget to use the cookies from your browser, using

Given the frequency of the login and then download a file use case , it
should probably be documented on the wiki. (Perhaps it already is. :-)

Also, it would probably be helpful to have a shell script to automate this.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Tony Lewis
Coombe, Allan David (DPS) wrote:

 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think wget uses the case from the HTML page(s) for the file name; your
proxy would need to change the URLs in the HTML pages to lower case too.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Tony Lewis
mm w wrote:

 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

Why do you assume the user of wget has any control over the server from which 
content is being downloaded?



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-14 Thread Tony Lewis
mm w wrote:

 Hi, after all, after all it's only my point of view :D
 anyway,
 
 /dir/file,
 dir/File, non-standard
 Dir/file, non-standard
 and /Dir/File non-standard

According to RFC 2396: The path component contains data, specific to the 
authority (or the scheme if there is no authority component), identifying the 
resource within the scope of that scheme and authority.

In other words, those names are well within the standard when the server 
understands them. As far as I know, there is nothing in Internet standards 
restricting mixed case paths.

 that's it, if the server manages non-standard URL, it's not my
 concern, for me it doesn't exist

Oh. I see. You're writing to say that wget should only implement features that 
are meaningful to you. Thanks for your narcissistic input.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could
 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
 asking Wget to treat URLs case-insensitively.

To have the effect that Allan seeks, I think the option would have to convert 
all URIs to lower case at an appropriate point in the process. I think you 
probably want to send the original case to the server (just in case it really 
does matter to the server). If you're going to treat different case URIs as 
matching then the lower-case version will have to be stored in the hash. The 
most important part (from the perspective that Allan voices) is that the 
versions written to disk use lower case characters.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
mm w wrote:

 standard: the URL are case-insensitive

 you can adapt your software because some people don't respect standard,
 we are not anymore in 90's, let people doing crapy things deal with
 their crapy world

You obviously missed the point of the original posting: how can one 
conveniently mirror a site whose server uses case insensitive names onto a 
server that uses case sensitive names.

If the original site has the URI strings /dir/file, dir/File, Dir/file, 
and /Dir/File, the same local file will be returned. However, wget will treat 
those as unique directories and files and you wind up with four copies.

Allan asked if there is a way to have wget just create one copy and proposed 
one way that might accomplish that goal.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Steven M. Schweda wrote:

 From Tony Lewis:
  To have the effect that Allan seeks, I think the option would have to
  convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
 details like name hashing may also need to be adjusted, but this
 shouldn't be too hard.

OK. How would you normalize the names?

Tony



RE: retrieval of data from a database

2008-06-10 Thread Tony Lewis
Saint Xavier wrote:

 Well, you'd better escape the '' in your shell (\)

It's probably easier to just put quotes around the entire URL than to try to
find all the special characters and put backslashes in front of them.

Tony




RE: Not all files downloaded for a web site

2008-01-27 Thread Tony Lewis
Matthias Vill wrote:

 Alexandru Tudor Constantinescu wrote:
  I have the feeling wget is not really able to figure out which files
  to download from some web sites, when css files are used.

 That's right. Up until wget 1.11 (released yesterday) there is no
 support for css-files in the matter of parsing links out of it. There
 for wget will download the css-file, but not any file referenced only
there.

According to Micah's Future of Wget email, CSS support is planned for
1.12. He wrote:

 1.12
 - 
  Support for parsing links from CSS.
[snip]
 The really big deal here, to me, is CSS. I want to have CSS support for
 Wget ASAP. It's an essential part of the Web, and users definitely
 suffer for the lack of support for it.

Tony



RE: Skip certain includes

2008-01-24 Thread Tony Lewis
Wayne Connolly wrote:

 

 Thanks mate- i know we chatted on IRC but just thought someone

 else may be able to provide some insight.

 

OK. Here's some insight: wget is essentially a web browser. If the URL
starts with http, then wget sees the exact same content as Internet
Explorer, Firefox, and Opera (except in cases where the server customizes
its content to the user agent - in those cases you may have to tweak the
user agent to see the same content).

 

If the files are visible to FTP, then try using wget with a URL starting
with ftp instead.  Otherwise, if you want to mirror the files as they
appear on the server, you will have to use something like scp to transfer
the files directly from Server A to Server B.

 

Tony

 



RE: .1, .2 before suffix rather than after

2007-11-16 Thread Tony Lewis
Hrvoje Niksic wrote:

  And how is .tar.gz renamed?  .tar-1.gz?

 Ouch.

OK. I'm responding to the chain and not Hrvoje's expression of pain. :-)

What if we changed the semantics of --no-clobber so the user could specify
the behavior? I'm thinking it could accept the following strings:
- after: append a number after the file name (current behavior)
- before: insert a number before the suffix
- new: change name of new file (current behavior)
- old: change name of old file

With this scheme --no-clobber becomes equivalent to --no-clobber=after,new.
If I want to change where the number appears in the file name or have the
old file renamed then I can specify the behavior I want on the command line
(or in .wgetrc). I think I would change my default to
--no-clobber=before,old.

I think it would be useful to have semantics in .wgetrc where I specify what
I want my --no-clobber default to be without that meaning I want
--no-clobber processing on each invocation. It would be nice if I could say
that I want my default to be before,old, but to only have that apply when
I specify --no-clobber on the command line.

Back to the painful point at the start of this note, I think we treat
.tar.gz as a suffix and if --no-clobber=before is specified, the file name
becomes .1.tar.gz.

Tony



RE: Thoughts on Wget 1.x, 2.0 (*LONG!*)

2007-11-02 Thread Tony Lewis
Micah Cowan wrote:

 Keeping a single Wget and using runtime libraries (which we were terming
 plugins) was actually the original concept (there's mention of this in
 the first post of this thread, actually); the issue is that there are
 core bits of functionality (such as the multi-stream support) that are
 too intrinsic to separate into loadable modules, and that, to be done
 properly (and with a minimum of maintenance commitment) would also
 depend on other libraries (that is, doing asynchronous I/O wouldn't
 technically require the use of other libraries, but it can be a lot of
 work to do efficiently and portably across OSses, and there are already
 Free libraries to do that for us).

Perhaps both versions can include multi-threaded support in their core version, 
but the lite version would never invoke multi-threading.

Tony



RE: Recursive downloading and post

2007-10-22 Thread Tony Lewis
Micah Cowan wrote

 Stuart Moore wrote:
  Is there any way to get wget to only use the post data for the first
  file downloaded?

 Unfortunately, I'm not sure I can offer much help. AFAICT, --post-file
 and --post-data weren't really designed for use with recursive
 downloading.

Perhaps not, but I can't imagine that there is any scenario where the POST
data should legitimately be sent for anything other than the URL(s) on the
command line.

I'd vote for this being flagged as a bug.

Tony



RE: working on patch to limit to percent of bandwidth

2007-10-10 Thread Tony Lewis
Hrvoje Niksic wrote:

 Measuring initial bandwidth is simply insufficient to decide what
 bandwidth is really appropriate for Wget; only the user can know
 that, and that's what --limit-rate does.

The user might be able to make a reasonable guess as to the download rate if
wget reported its average rate at the end of a session. That way the user
can collect rates over time and try to give --limit-rate a reasonable value.

Tony



RE: wget + dowbloading AV signature files

2007-09-22 Thread Tony Lewis
Gerard Seibert wrote:

 Is it possible for wget to compare the file named AV.hdb'
 located in one directory, and if it is older than the AV.hdb.gz file
 located on the remote server, to download the AV.hdb.gz file to the
 temporary directory?

No, you can only get wget to compare a file of the same name between your
local system and the remote server.

 The only option I have come up with is to keep a copy of the gz file
 in the temporary directory and run wget from there.

You will need to keep the original gz file with a timestamp matching the
server in order for wget to know that the file you have is the same as the
one on the server.

 Unfortunately, at least as far as I can tell, wget does not issue an
 exit code if it has downloaded a newer file.

Better exit codes is on the wish list.

 It would really be nice though if wget simply issued an exit code if
 an updated file were downloaded.

Yes, it would.

 Therefore, I am unable to craft a script that will unpack the file,
 test and install it if a newer version has been downloaded.

Keep one directory that matches the server and another one (or perhaps two)
where you process new files. Before and after wget runs, you can check the
dates on the directory that matches the server. You only need to process
files that changed.

Hope that helps.

Tony




RE: wget url with hash # issue

2007-09-06 Thread Tony Lewis
Micah Cowan wrote:

 If you mean that you want Wget to find any file that matches that
 wildcard, well no: Wget can do that for FTP, which supports directory
 listings; it can't do that for HTTP, which has no means for listing
 files in a directory (unless it has been extended, for example with
 WebDAV, to do so).

Seems to me that is a big unless because we've all seen lots of websites
that have http directory listings. Apache will do it out of the box (and by
default) if there is no index.htm[l] file in the directory.

Perhaps we could have a feature to grab all or some of the files in a HTTP
directory listing. Maybe something like this could be made to work:

wget http://www.exelana.com/images/mc*.gif

Perhaps we would need an option such as --http-directory (the first thing
that came to mind, but not necessarily the most intuitive name for the
option) to explicitly tell wget how it is expected to behave. Or perhaps it
can just try stripping the filename when doing an http request and wildcards
are specified.

At any rate (with or without the command line option), wget would retrieve
http://www.exelana.com/images/ and then retrieve any links where the target
matches mc*.gif.

If wget is going to explicitly support http directory listings, it probably
needs to be intelligent enough to ignore the sorting options. In the case of
Apache, that would be things like A HREF=?N=DName/A.

Anyone have any idea how many different http directory listing formats are
out there?

Tony



RE: Overview of the wget source code (command line options)

2007-07-24 Thread Tony Lewis
Himanshu Gupta wrote:

 

 Thanks Josh and Micah for your inputs.

 

In addition to whatever Josh and Micah told you, let me add the information
that follows. More than once I have had to relearn how wget deals with
command line options. The last time I did so, I created the HOWTO that
appears below (comments about this information from those in the know on
this list are welcome). I'm happy to collect any other topics that people
want to submit and add them to the file. Perhaps Micah will even be willing
to add it to the repository. :-)

 

By the way, if your mail reader throws away line breaks, you will want to
restore them. --Tony

 

To find out what a command line option does:

  Look in src/main.c in the option_data array for the string to corresponds
to

  the command line option; the entries are of the form:

  { option, 'O', TYPE, data, argtype },

 

  where you're searching for option.

 

  If TYPE is OPT_BOOLEAN or OPT_VALUE:

Note the value of data. Then look at init.c at the commands array for

an entry that starts with the same data. These lines are of the form:

{ data, opt.variable, cmd_TYPE },

 

The corresponding line will tell you what variable gets set when that
option

is selected. Now use grep or some other search tool to find out where
the

variable is referenced.

 

For example, the --accept option sets the value of opt.accepts, which is

referenced in ftp.c and utils.c

 

  If the TYPE is anything else:

Look to see how main.c handles that TYPE.

 

For example, OPT__APPEND_OUTPUT sets the option named logfile and then

sets the variable append_to_log to true. Searching for append_to_log

shows that it is only used in main.c. Checking init.c (as described
above)

for the option logfile shows that it sets the value of opt.lfilename,

which is referenced in mswindows.c, progress.c, and utils.c.

 




 

To add a new command line option:

  The simplest approach is to find an existing option that is close to what
you

  want to accomplish and mirror it. You will need to edit the following
files

  as described.

 

  src/main.c

Add a line to the option_data array in the following format:

  { option, 'O', TYPE, data, argtype },

 

where:

  option   is the long name to be accepted from the command line

  Ois the short name (one character) to be accepted from the

   command or '' if there is no short name; the short name

   must only be assigned to one option. Also, there are very

   few short names available and the maintainers are not

   inclined to give them out unless the option is likely to

   be used frequently.

  TYPE is one of the following standard options:

 OPT_VALUEon the command line, option must be

  followed by a value that will be stored

  ?somewhere?

 OPT_BOOLEAN  option is a boolean value that may appear

  on the command line as --option for true

  or --no-option for false

 OPT_FUNCALL  an internal function will be invoked if the

  option is selected on the command line

   Note: If one of these choices won't work for your option

   you can add a new value of the OPT__XXX to the enum list

   and add special code to handle it in src/main.c.

  data For OPT_VALUE and OPT_BOOLEAN, the name assigned to the

   option in the commands array defined in src/init.c (see

   below). For OPT_FUNCALL, a pointer to the function to be

   invoked.

  argtype  For OPT_VALUE and OPT_BOOLEAN, use -1. For OPT_FUNCALL use

   no_argument.

 

NOTE: The options *must* appear in alphabetical order because a Boolean

search is used for the list.

 

  src/main.c

Add the help string to function print_help as follows:

N_(\

  -O,  --optiondoes something nifty.\n),

 

If the short name is '', put spaces in place of -O,.

 

Select a reasonable place to add the text into the help output in one

of the existing groups of options: Startup, Logging and input file,

Download, Directories, HTTP options, HTTPS (SSL/TLS) options,

FTP options, Recursive download, or Recursive accept/reject.

 

  src/options.h

Define the variable to receive the value of the option in the options

structure.

 

  src/init.c

Add a line to the commands array in the following format:

  { data, opt.variable, cmd_TYPE },

 

where:

  data  matches the data string you entered above in the

options_data array in src/main.c

  variable  is the variable you defined in 

RE: Problem with combinations of the -O , -p, and -k parameters in wget

2007-07-23 Thread Tony Lewis
Michiel de Boer wrote:

 Is there another way though to achieve the same thing?

You can always run wget and then rename the file afterward. If this happens
often, you might want write a shell script to handle it. Of course, If you
want all the references to the file to be converted, the script will be a
little more complicated. :-)

Tony



RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote:

 The manpage doesn't need to give as detailed explanations as the info
 manual (though, as it's auto-generated from the info manual, this could
 be hard to avoid); but it should fully describe essential features.

I can't see any good reason for one set of documentation to be different than 
another. Let the user choose whatever is comfortable. Some users may not even 
know they have a choice between man and info.

 While we're on the subject: should we explicitly warn about using such
 features as robots=off, and --user-agent? And what should those warnings
 be? Something like, Use of this feature may help you download files
 from which wget would otherwise be blocked, but it's kind of sneaky, and
 web site administrators may get upset and block your IP address if they
 discover you using it?

No, I don't think we should nor do I think use of those features is sneaky.

With regard to robots.txt, people use it when they don't want *automated* 
spiders crawling through their sites. A well-crafted wget command that 
downloads selected information from a site without regard to the robots.txt 
restrictions is a very different situation. It's true that someone could 
--mirror the site while ignoring robots.txt, but even that is legitimate in 
many cases.

With regard to user agent, many websites customize their output based on the 
browser that is displaying the page. If one does not set user agent to match 
their browser, the retrieved content may be very different than what was 
displayed in the browser.

All that being said, it wouldn't hurt to have a section in the documentation on 
wget etiquette: think carefully about ignoring robots.txt, use --wait to 
throttle the download if it will be lengthy, etc.

Perhaps we can even add a --be-nice option similar to --mirror that adjusts 
options to match the etiquette suggestions.

Tony



RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote:

 Don't we already follow typical etiquette by default? Or do you mean
 that to override non-default settings in the rcfile or whatnot?

We don't automatically use a --wait time between requests. I'm not sure what 
other nice options we'd want to make easily available, but there are probably 
more.

Tony



RE: Maximum 20 Redirections HELP!!!

2007-07-16 Thread Tony Lewis
Josh Williams wrote:

 Hmm. .org, maybe?

LOL. Do you know how many kewl domain names I had to go through before I
found one that didn't actually exist? Close to a dozen.

Tony



RE: 1.11 Release Date: 15 Sept

2007-07-12 Thread Tony Lewis
Noèl Köthe wrote:

 A switch to the new GPL v3 is a not so small change and like samba
 (3.0.x - 3.2) would imho be a good reason for wget 1.2 so everybody
 sees something bigger changed.

There already was a version 1.2 (although the program was called geturl at that 
time).

The number scheme could probably use a facelift. Perhaps when we transition to 
2.0, we can add a third digit.

Tony



wget on gnu.org: error on Development page

2007-07-07 Thread Tony Lewis
On http://www.gnu.org/software/wget/wgetdev.html, step 1 of the summary is:

1.  Change to the topmost GNU Wget directory:
%  cd wget 

But you need to cd to either wget/trunk or the appropriate version
subdirectory of wget/branches.



RE: wget on gnu.org: Report a Bug

2007-07-07 Thread Tony Lewis
Micah Cowan wrote:

 This information is currently in the bug submitting form at Savannah:

That looks good.

 I think perhaps such things as the wget version and operating system
 ought to be emitted by default anyway (except when -q is given).

I'm not convinced that wget should ordinarily emit the operating system. It's 
really only useful to someone other than the person running the command.

 Other than that, what kinds of things would --bug provide above and
 beyond --debug?

It should echo the command line and the contents of .wgetrc to the bug output, 
which even the --debug option does not do. Perhaps we will think of other 
things to include in the output if this option gets added.

However, the big difference would be where the output was directed. When 
invoked as:
wget ... --bug bug_report

all interesting (but sanitized) information would be written to the file 
bug_report whether or not the command included --debug, which would also direct 
the debugging output to STDOUT.

The main reason I had for suggesting this option is that it would be easy to 
tell newbies with problems to run the exact same command with --bug 
bug_report and send the file bug_report to the list (or to whomever is working 
on the problem). The user wouldn't see the command behave any differently, but 
we'd have the information we need to investigate the report.

It might even be that most of us would choose to run with --bug most of the 
time relying on the normal wget output except when something appears to have 
gone wrong and then checking the file when it does.

Tony



RE: wget on gnu.org: error on Development page

2007-07-07 Thread Tony Lewis
Micah Cowan wrote:

 Done. Lemme know if that works for you.

Looks good




RE: bug and patch: blank spaces in filenames causes looping

2007-07-05 Thread Tony Lewis
There is a buffer overflow in the following line of the proposed code:

 sprintf(filecopy, \%.2047s\, file);

It should be:

 sprintf(filecopy, \%.2045s\, file);

in order to leave room for the two quotes.

Tony
-Original Message-
From: Rich Cook [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 04, 2007 10:18 AM
To: [EMAIL PROTECTED]
Subject: bug and patch: blank spaces in filenames causes looping

On OS X, if a filename on the FTP server contains spaces, and the  
remote copy of the file is newer than the local, then wget gets  
thrown into a loop of No such file or directory endlessly.   I have  
changed the following in ftp-simple.c, and this fixes the error.
Sorry, I don't know how to use the proper patch formatting, but it  
should be clear.

==
the beginning of ftp_retr:
=
/* Sends RETR command to the FTP server.  */
uerr_t
ftp_retr (int csock, const char *file)
{
   char *request, *respline;
   int nwritten;
   uerr_t err;

   /* Send RETR request.  */
   request = ftp_request (RETR, file);

==
becomes:
==
/* Sends RETR command to the FTP server.  */
uerr_t
ftp_retr (int csock, const char *file)
{
   char *request, *respline;
   int nwritten;
   uerr_t err;
   char filecopy[2048];
   if (file[0] != '') {
 sprintf(filecopy, \%.2047s\, file);
   } else {
 strncpy(filecopy, file, 2047);
   }

   /* Send RETR request.  */
   request = ftp_request (RETR, filecopy);






--
Rich wealthychef Cook
925-784-3077
--
  it takes many small steps to climb a mountain, but the view gets  
better all the time.



RE: Suppressing DNS lookups when using wget, forcing specific IP address

2007-06-18 Thread Tony Lewis
Try: wget http://ip.of.new.sitename --header=Host: sitename.com --mirror

For example: wget http://66.233.187.99 --header=Host: google.com --mirror

Tony
-Original Message-
From: Kelly Jones [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 17, 2007 6:10 PM
To: wget@sunsite.dk
Subject: Suppressing DNS lookups when using wget, forcing specific IP
address

I'm moving a site from one server to another, and want to use wget
-m combined w/ diff -auwr to help make sure the site looks the same
on both servers.

My problem: wget -m sitename.com always downloads the site at its
*current* IP address. Can I tell wget: download sitename.com, but
pretend the IP address of sitename.com is ip.address.of.new.server
instead of ip.address.of.old.server. In other words, suppress the DNS
lookup for sitename.com and force it to use a given IP address.

I've considered kludges like using old.sitename.com vs
new.sitename.com, editing /etc/hosts, using a proxy server, etc,
but I'm wondering if there's a clean solution here?

-- 
We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.



RE: Question on wget upload/dload usage

2007-06-18 Thread Tony Lewis
Joe Kopra wrote:

 

 The wget statement looks like:

 

 wget --post-file=serverdata.mup -o postlog -O survey.html

   http://www14.software.ibm.com/webapp/set2/mds/mds

 

--post-file does not work the way you want it to; it expects a text file
that contains something like this:

a=1b=2

 

and it sends that raw text to the server in a POST request using a
Content-Type of application/x-www-form-urlencoded. If you run it with -d,
you will see something like this:

 

POST /someurl HTTP/1.0

User-Agent: Wget/1.10

Accept: */*

Host: www.exelana.com

Connection: Keep-Alive

Content-Type: application/x-www-form-urlencoded

Content-Length: 7

 

---request end---

[writing POST file data ... done]

 

To post a file as an argument, you need a Content-Type of
multipart/form-data, which wget does not currently support.

 

Tony



RE: wget bug

2007-05-24 Thread Tony Lewis
Highlord Ares wrote:

 

 it tries to download web pages named similar to

  http://site.com?variable=yesmode=awesome
http://site.com?variable=yesmode=awesome

 

Since  is a reserved character in many command shells, you need to quote
the URL on the command line:

 

wget  http://site.com?variable=yesmode=awesome
http://site.com?variable=yesmode=awesome;

 

Tony

 



RE: sending Post Data and files

2007-05-09 Thread Tony Lewis
Lara Röpnack wrote:

 

 1.) How can I send Post Data with Line Breaks? I can not press enter

 and \n or \r or \r\n dont work...

 

You don’t need a line break because parameters are separated by ampersands;
a=1b=2


 2.) I dont understand the post File. I can Send one file - but I cant give

 the name. Normaly I have a Form with a Formelement Input type=file

 name=xy Is it possible to send a File with a name? Is it possible to send

 two files? 

 

On the command line you can use --post-data=”a=1b=2” or you can put the
data into a file. For example if the file “foo” contains the following
string:

a=1b=2

you would use --post-file=foo.

 

Currently, it is not possible to send files with wget. It does not support
multipart/form-data.

 

Tony



RE: FW: think you have a bug in CSS processing

2007-04-13 Thread Tony Lewis
J.F.Groff wrote:

 Amazingly I found this feature request in a 2003 message to this very
mailing
 list. Are there only a few lunatics like me who think this should be
included?

Wget is written and maintained by volunteers. What you need to find is a
lunatic willing to volunteer to write the code to support this feature
request.

Tony



RE: Suggesting Feature: Download anything newer than...

2007-04-07 Thread Tony Lewis
I don't think there is such a feature, but if you're going to add
--not-before, you might as well add --not-after too.

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Saturday, April 07, 2007 6:27 PM
To: wget@sunsite.dk
Subject: Suggesting Feature: Download anything newer than...

I'm a very frequent user of wget, but must admit I haven't
dived too deep into various options - but as far as I can
tell, what I'm about to suggest is not a current feature.
If it is, can somebody tell me how to access it?  0:-)

What I'm suggesting is something similar to -N (check
timestamp and download newer) and may perhaps be used more
as a modifier to -N than a seperate option.

I occationally make a mirror of certain site with wget, and
then throw it into an archive.  Unfortuanately, a few months
(year) later when I want to catch-up with any updates, I either
have to mirror the whole thing again or locate the old archive
and unpack it (and I haven't necesserely preserved the whole
directory structure).

What I would love was the ability to specify (through an option)
an arbitrary timestamp (a date... and perhaps time), and for
only files created/modified after this time to be downloaded (e.g.
the approximate time for the creation of my latest archive).

I am envision it as based on the -N option; except that rather
than looking on the time-stamp - or the size or even the
existance - of a local file, it would only compare the remote file's
timestamp to the supplied timestamp - and download if the remote
file was newer.  Of course, it would probably be h*** of a lot worse
to program than just rewriting the -N option.  :-)

It would have to parse links in HTML-files (HTML) or traverse
directories (FTP).

Usually it would be used when no local mirror existed, and then
creating a mirror of just files made after a certain time (it would
of course have to create a dir-structure containing directories
also older than the specified time, but no older files).  However
being able to use it (a specified time) together with the -N or
--mirror option, may also be useful when updating a local mirror
(though I can't actually see when); so perhaps it should be an option
to be used in *companion* with -N (rather than instead of -N)... or
at least let it be *possible* to use it together with -N and --mirror
as well as by itself.

-Koppe



RE: Cannot write to auto-generated file name

2007-04-03 Thread Tony Lewis
Vitaly Lomov wrote:

 It's a file system issue on windows: file path length is limited to
 259 chars.

In which case, wget should do something reasonable (generate an error
message, truncate the file name, etc.). It shouldn't be left as exercise for
the user to figure out that the automatically generated name cannot be used
by the OS. (My vote is to truncate the name, but it's a lot easier to
generate an error message.)

Tony



RE: Huh?...NXDOMAINS

2007-03-23 Thread Tony Lewis
Bruce [EMAIL PROTECTED] wrote:

 the hostname 'ga13.gamesarena.com.au' resoles back to an NX domain 

NXDOMAIN is short hand for non-existent domain. It means the domain name
system doesn't know the IP address of the domain. (It would be like me
having a non-published telephone number; if you know my number, you can call
me, but it won't do you any good to call directory assistance because they
can't tell you my number.)

If your web browser is able to find the site then it should be possible for
wget to find it too. But, since it's not a straightforward DNS lookup,
you'll have to figure out how your browser is pulling off the magic.

One way to do that is to run with a local proxy (such as Achilles) and study
what happens between your browser and the server. If you compare that with
the debug output of wget, you'll have an idea of where the flow is different
and what wget might do to make it work.

I'm sure someone can point out open-source options for the proxy. :-)

Have fun exploring.

Tony



RE: wget help on file download

2007-03-01 Thread Tony Lewis
The server told wget that it was going to return 6K:
 
Content-Length: 6720
  _  

From: Smith, Dewayne R. [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 01, 2007 8:05 AM
To: [EMAIL PROTECTED]
Subject: wget help on file download


Trying to download a 4mb file. it only retrieves 6k of it.
I've tried without the added --options and it doesn't work.
 
Can you see any issues below?
Thx!
 
 
C:\Backup_CD\WGETwget -dv -S --no-http-keep-alive  --ignore-length
--secure-protocol=auto --no-check-certificate  https://
server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/65dc9ed71
3ae030f85256f31006eb413/$FILE/TR%202004.018%20AEGIS%20TEST%20PLAN..pdf
Setting --verbose (verbose) to 1
Setting --server-response (serverresponse) to 1
Setting --http-keep-alive (httpkeepalive) to 0
Setting --ignore-length (ignorelength) to 1
Setting --secure-protocol (secureprotocol) to auto
Setting --check-certificate (checkcertificate) to 0
DEBUG output created by Wget 1.10.2 on Windows.
 
--11:01:08--
https://server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/
65dc9ed713ae030f85256f31006eb413/$F
https://server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/6
5dc9ed713ae030f85256f31006eb413/$F
ILE/TR%202004.018%20AEGIS%20TEST%20PLAN..pdf
   = `TR 2004.018 AEGIS TEST PLAN..pdf.4'
Resolving server2.csci-va.com... seconds 0.00, 65.207.33.26
Caching server2.csci-va.com = 65.207.33.26
Connecting to server2.csci-va.com|65.207.33.26|:443... seconds 0.00,
connected.
Created socket 1932.
Releasing 0x00395228 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 1932 to SSL handle 0x009318c8
certificate:
  subject: /C=US/O=U.S.
Government/OU=ECA/OU=ORC/OU=CSCI/CN=server2.csci-va.com
  issuer:  /C=US/O=U.S. Government/OU=ECA/OU=Certification
Authorities/CN=ORC ECA
WARNING: Certificate verification error for server2.csci-va.com: self signed
certificate in certificate chain
 
---request begin---
GET
/siap/siap.nsf/297c783b5c8fa51985256cd700546846/65dc9ed713ae030f85256f31006e
b413/$FILE/TR%202004.018%20AEGIS%20TEST%20PL
AN..pdf HTTP/1.0
User-Agent: Wget/1.10.2
Accept: */*
Host: server2.csci-va.com
 
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Server: Lotus-Domino
Date: Thu, 01 Mar 2007 15:57:55 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 6720
Pragma: no-cache
 
---response end---
 
  HTTP/1.1 200 OK
  Server: Lotus-Domino
  Date: Thu, 01 Mar 2007 15:57:55 GMT
  Connection: close
  Expires: Tue, 01 Jan 1980 06:00:00 GMT
  Content-Type: text/html; charset=UTF-8
  Content-Length: 6720
  Pragma: no-cache
Length: ignored [text/html]
 
[ =
] 6,720 --.--K/s
 
Closed 1932/SSL 0x9318c8
11:01:08 (309.48 KB/s) - `TR 2004.018 AEGIS TEST PLAN..pdf.4' saved [6720]
 

C:\Backup_CD\WGET
 
Dewayne R. Smith 

SPAWAR Systems Center Charleston 

Code 613, Special Projects Branch 

Office (843) 218-4393

Mobile (843) 696-9472

 


RE: how to get images into a new directory/filename heirarchy? [GishPuppy]

2007-02-23 Thread Tony Lewis
If it were me, I'd grab all the files to my local drive and then write
scripts to do the moving and renaming.

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 23, 2007 1:33 AM
To: wget@sunsite.dk
Subject: how to get images into a new directory/filename heirarchy?
[GishPuppy]

Hi,

I'm trying to use wget to download 100s of JPGs into a cache server with a
different directory/filename heirarchy. What I tried to do was to create a
text or html file with 1 line for each download (e.g. URL -nd -P [new-path]
-O [new-filename]) and use the --input-file= switch, However, I discovered
that I cannot rename the path/filename of the file inside the input file.

Also, the JPGs will not all come from the same domain but they need to be
placed in a flattened directory tree with different filenames.

Can anyone offer me advice on how to best accomplish this? I'm using the
windows platform.

m.

Gishpuppy | To reply to this email, click here: 
http://www.gishpuppy.com/cgi-bin/[EMAIL PROTECTED]



RE: php form

2007-02-22 Thread Tony Lewis
The table stuff just affects what's shown on the user's screen. It's the
input field that affects what goes to the server; in this case, that's
input ... name=country ... so you want to post country=US. If there were
multiple fields, you would separate them with ampersands such as
country=USstate=CA.
 
Tony

  _  

From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 5:27 PM
To: Tony Lewis; wget@sunsite.dk
Subject: Re: php form


Tony,
Thanks.  I have to log in with username/password, and I think I
know how to do that with wget using POST.  For the actual search page, the
HTML source says it`s: 
 
form action=full_search.php method=POST
 
However, I`m not clear on how to convey the data for the search.  
 
The search for has defined a table.  One of the entries, for example, is:
 
tr
  tdbfont face=ArialSearch by Country:/font/b/td
  tdinput type=text name=country size=50 maxlength=100/td
/tr
 
If I want to use wget to search for entries in the U.S. (US), then how do
I convey this when I post to the php?
 
Thanks, Alan 

- Original Message - 
From: Tony Lewis mailto:[EMAIL PROTECTED]  
To: 'Alan Thomas' mailto:[EMAIL PROTECTED]  ; wget@sunsite.dk 
Sent: Thursday, February 22, 2007 12:53 AM
Subject: RE: php form

Look for form action=some-web-page method=XXX ...
 
action tells you where the form fields are sent.
 
method tells you if the server is expecting the data to be sent using a GET
or POST command; GET is the default. In the case of GET, the arguments go
into the URL. If method is POST, follow the instructions in the manual.
 
Hope that helps.
 
Tony

  _  

From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 21, 2007 4:39 PM
To: wget@sunsite.dk
Subject: php form


There is a database on a web server (to which I have access) that is
accessible via username/password.  The only way for users to access the
database is to use a form with search criteria and then press a button that
starts a php script that produces a web page with the results of the search.
 
I have a couple of questions:
 
1.  Is there any easy way to know exactly what commands are behind the
button, to duplicate them?
 
2.  If so, then do I just use the POST command as described in the manual,
after logging in (per the manual), to get the data it provides.  
 
I have used wget just a little, but I am completely new to php.  
 
Thanks, Alan
 
 
 



RE: SI units

2007-01-15 Thread Tony Lewis
Lars Hamren wrote: 

 Download speeds are reported as K/s, where, I assume, K is short for
kilobytes.

 The correct SI prefix for thousand is k, not K:

http://physics.nist.gov/cuu/Units/prefixes.html


SI units are for decimal-based numbers (that is powers of 10) whereas
computer programs typically use binary-based numbers (powers of 2). It's
convenient for humans to equate 10^3 (1,000) with 2^10 (1,024) but with
large numbers, these values quickly diverge: 999k or 999 * 10^3 = 999,000,
but 999K or 999 * 2^10 = 1,022,976.

For what it's worth, according to Wikipedia either k or K is acceptable for
1024:
  http://en.wikipedia.org/wiki/Binary_prefix



RE: SI units

2007-01-15 Thread Tony Lewis
Christoph Anton Mitterer wrote: 

 I don't agree with that,.. SI units like K/M/G etc. are specified by
 international standards and those specify them as 10^x.

 The IEC defined in IEC 60027 symbols for the use with base 2 (e.g. Ki, Mi,
Gi)

All of this is described in the Wikipedia article I referenced.

It's true that International Electrotechnical Commission prefers the term
kibibytes and the prefix Ki for 1,024, but it's still not a term commonly
used in computer standards.

Searching ietf.org there are 1,880 matches for kilobytes and only 2 for
kibibytes and those are both feedback from one individual arguing for the
use of kibibytes instead of kilobytes.

Searching gnu.org there are 452 matches for kilobytes and only 5 for
kibibytes and even then, the following appears:  `KiB' kibibyte: 2^10 =
1024. `K' is special: the SI prefix is `k' and the IEC 60027-2 prefix is
`Ki', but tradition and POSIX use `k' to mean `KiB'.

It seems odd to me that one would suggest that wget is the place to start
changing the long-established trend of using 'k' for 1,024.



RE: wget question (connect multiple times)

2006-10-17 Thread Tony Lewis
A) This is the list for reporting bugs. Questions should go to
wget@sunsite.dk

B) wget does not support multiple time simultaneously

C) The decreased per-file download time you're seeing is (probably) because
wget is reusing its connection to the server to download the second file. It
takes some time to set up a connection to the server regardless of whether
you're downloading one byte or one gigabyte of data. For small files, the
set up time can be a significant part of the overall download time.

Hope that helps!

Tony
-Original Message-
From: t u [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 17, 2006 3:50 PM
To: [EMAIL PROTECTED]
Subject: wget question (connect multiple times)

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

hi,
I hope it is okay to drop a question here.

I recently found that if wget downloads one file, my download speed will be
Y, but if wget downloads two separate files (from the same server, doesn't
matter), the download speed for each of the files will be Y (so my network
speed will go up to 2 x Y).

So my question is, can I make wget download the same file multiple times
simultaneously? In a way, it would run as multiple processes and download
parts of the file at the same time, speeding up the download.

Hope I could explain my question, sorry about the bad english.

Thanks

PS. Please consider this as an enhancement request if wget cannot get a file
by downloading parts of it simultaneously.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFNV4YLM1JzWwJYEYRAsEEAJ9FTx+hURJD5VudhbN2f7Iight3AACcDa6f
tO3WuBYygfKLA2Pis8Fbcos=
=7kNq
-END PGP SIGNATURE-



RE: I got one bug on Mac OS X

2006-07-15 Thread Tony Lewis



I don't think that's valid HTML. According to RFC 
1866: An HTML user 
agent should treat end of line in any of its variations as a word space in all 
contexts except preformatted text.
I 
don't see any provision for end of line within the HREF attribute of an A 
tag.

Tony


From: HUAZHANG GUO [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 11, 2006 7:48 AMTo: 
[EMAIL PROTECTED]Subject: I got one bug on Mac OS 
X

Dear Sir/Madam,
while I was trying to download 
using the command:

wget -k -np -r 
-l inf -E http://dasher.wustl.edu/bio5476/

I got most of the files, but lost some of them.

I think I know where the problem is:

if the link is broken into two lines in the index.html:

PLecture 1 (Jan 17): Exploring Conformational Space 
for Biomolecules
A HREF=""http://dasher.wustl.edu/bio5476/lectures">http://dasher.wustl.edu/bio5476/lectures
/lecture-01.pdf"[PDF]/A/P

I will get the following error 
message:

--09:13:16-- http://dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf
= 
`/Users/hguo/mywww//dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf'
Connecting to dasher.wustl.edu[128.252.208.48]:80... connected.
HTTP request sent, awaiting response... 404 Not Found
09:13:16 ERROR 404: Not Found.

Please note that wget adds a special charactor '%0A' in the URL. Maybe the 
Windows new line have one more charactor which is not recoganized by Mac 
wget.

I am using Mac OS X, Tigger Darwin.


Thanks!








RE: wget - Returning URL/Links

2006-07-10 Thread Tony Lewis
Mauro Tortonesi wrote:

 perhaps we should modify wget in order to print the list of touched
 URLs as well? maybe only in case -v is given? what do you think?

On June 28, 2005, I submitted a patch to write unfollowed links to a file.
It would be pretty simple to have a similar --followed-links option.

Tony



RE: BUG

2006-07-03 Thread Tony Lewis
Title: RE: BUG






Run the command with -d and post the output here.


Tony

_ 

From:  Junior + Suporte [mailto:[EMAIL PROTECTED]] 

Sent: Monday, July 03, 2006 2:00 PM

To: [EMAIL PROTECTED]

Subject: BUG


Dear,


I using wget to send login request to a site, when wget is saving the cookies, the following error message appear:


Error in Set-Cookie, field `Path'Syntax error in Set-Cookie: tu=661541|802400391

@TERRA.COM.BR; Expires=Thu, 14-Oct-2055 20:52:46 GMT; Path= at position 78.

Location: http://www.tramauniversitario.com.br/servlet/login.jsp?username=802400

391%40terra.com.brpass=123qwerd=http%3A%2F%2Fwww.tramauniversitario.com.br%2Ft

uv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp [following]


I trying to access URL http://www.tramauniversitario.com.br/tuv2/participe/login.jsp?rd=http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp[EMAIL PROTECTED]pass=123qweSubmit.x=6Submit.y=1

In Internet Explorer, this URL work correctly and the cookie is saved in the local machine, but in WGET, this cookie return an error. 

Thanks,


Luiz Carlos Zancanella Junior





RE: wget - tracking urls/web crawling

2006-06-23 Thread Tony Lewis
Bruce wrote: 

 any idea as to who's working on this feature?

Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started working on the feature or not.

Tony



RE: Batch files in DOS

2006-06-05 Thread Tony Lewis
I think there is a limit to the number of characters that DOS will accept on
the command line (perhaps around 256). Try putting echo in front of the
command in your batch file and see how much of it gets echoed back to you.
As Tobias suggested, you can try moving some of your command line options
into the .wgetrc file.

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Saturday, June 03, 2006 2:46 PM
To: wget@sunsite.dk
Subject: Batch files in DOS

I'm trying to mirror about 100 servers (small fanfic sites) using wget
--recursive --level=inf -Dblah.com, blah.com,blah.com some_address However,
when I run the batch file, it stops reading after a while; apparently my
command has too many characters.  Is there some other way I should be doing
this, or a workaround?

GNU Wget 1.10.1 running on Windows 98

-- 

http://www.aericanempire.com/



RE: I cannot get the images

2006-05-15 Thread Tony Lewis
The problem is your accept list; -A*.* says to accept any file that contains
at least one dot in the file name and
GetFile?id=DBJOHNUNZIOCSBMOMKRUconvert=image%2Fgifscale=3 doesn't contain
any dots.

I think you want to accept all files so just delete -A*.* from your argument
list because the default behavior is to accept everything.

Tony
-Original Message-
From: matis [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 15, 2006 6:09 AM
To: wget@sunsite.dk
Subject: I cannot get the images

Hi,
Im trying to get whole directory but images from database are ignored. If
you paste the address below this post to the browser (or even flashget) it
will download the image and open with a default extension .gif . But wget
informs file should be removed and then remove it :/ . As effect when
there's a picture on every page (with the address as below) only empty htmls
are downloaded. Does anybody knows what to do?

The address (with the wget command used by me):
wget --cache=off -p -m -erobots=off -t10 -v -A*.*
http://alo.uibk.ac.at:80/filestore/servlet/GetFile?id=DBJOHNUNZIOCSBMOMKRU
convert=image/gifscale=3
whole html address (broken):
http://www.literature.at/webinterface/library/ALO-
BOOK_V01?objid=13017page=3zoom=3

regards
matis



RE: wget www.openbc.com post-data/cookie problem

2006-05-04 Thread Tony Lewis
Erich Steinboeck wrote:

 Is there a way to trace the browser traffic and compare
 that to the wget traffic, to see where they differ.

You can use a web proxy. I like Achilles:
http://www.mavensecurity.com/achilles 

Tony



RE: Defining url in .wgetrc

2006-04-20 Thread Tony Lewis
ks wrote: 

 Just one more question.
 Something like this inside somefile.txt

 http://fly.srk.fer.hr/
 -r http://www.gnu.org/ -o gnulog
 -S http://www.lycos.com/

Why not use a batch file or command script (depending on what OS you're
using) containing something like:

wget http://fly.srk.fer.hr
wget -r http://www.gnu.org -o gnulog
wget -S http://www.lycos.com

Tony



RE: Windows Title Bar

2006-04-18 Thread Tony Lewis
Hrvoje Niksic wrote:

 Anyway, adding further customizations to an already questionnable feature
 is IMHO not a very good idea. 

Perhaps Derek would be happy if there were a way to turn off this
questionable feature.

Tony



RE: dose wget auto-convert the downloaded text file?

2006-04-16 Thread Tony Lewis
18 mao [EMAIL PROTECTED] wrote:

 then  save the page as 2.html with the FireFox browser

You should not assume that the file saved by any browser is the same as the
file delivered to the browser by the server. The browser is probably
manipulating line endings to match the conventions on your operating system
when it saves files so that CR, CR-LF, or LF, all become CR-LF (or whatever
your OS uses for line endings).

Tony



RE: download of images linkes in css does not work

2006-04-13 Thread Tony Lewis
It's not a bug; it's a (missing) feature. 
-Original Message-
From: Detlef Girke [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 13, 2006 3:17 AM
To: [EMAIL PROTECTED]
Subject: download of images linkes in css does not work

Hello,
I tried everything, but images, built in via CSS are neither downloaded nor
related with wget.
Example (inline style):
CSS-Terms like

div style=background-image : 
url(/files/inc/image/pjpeg/hintergrund_startseite.jpg); id=pfad 

do not have any effect on the downloaded web-page.

The same thing happens, when you write

{background-image : url(/files/inc/image/pjpeg/hintergrund_startseite.jpg);}

into a css-file.

Maybe other references in CSS do not work either. Perhaps you can prove
this.

If you could fix this problem, wget would be the best tool for me.
Thank you and best regards
Detlef


--
Detlef Girke, BIK Hamburg, Beratung, Tests und Workshops c/o DIAS GmbH,
Neuer Pferdemarkt 1, 20359 Hamburg [EMAIL PROTECTED],
www.bik-online.info, 040 43187513,Fax 040 43187519



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: 

 no. i was talking about regexps. they are more expressive
 and powerful than simple globs. i don't see what's the
 point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their
expressions will simply work, which will result in significant confusion
when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat it
as a regular expression. For the vast majority of cases, glob will work just
fine.

One might argue that it's a lot of work to implement regular expressions if
the default input format is a glob, but I think we should aim for both lack
of confusion and robust functionality. Using ,r means people get regular
expressions when they want them and know what they're doing. The universe of
wget users who know what they're doing are mostly subscribed to this
mailing list; the rest of them send us mail saying please CC me as I'm not
on the list. :-)

If we go this route, I'm wondering if the appropriate conversion from glob
to regular expression should take directory separators into account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match path/to/sub/dir? (I suspect it shouldn't.)

Tony



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: 

 But that misses the point, which is that we *want* to make the
 more expressive language, already used elsewhere on Unix, the
 default.

I didn't miss the point at all. I'm trying to make a completely different
one, which is that regular expressions will confuse most users (even if you
tell them that the argument to --filter is a regular expression). This
mailing list will get a huge number of bug reports when users try to use
globs that fail.

Yes, regular expressions are used elsewhere on Unix, but not everywhere. The
shell is the most obvious comparison for user input dealing with expressions
that select multiple objects; the shell uses globs.

Personally, I will be quite happy if --filter only supports regular
expressions because I've been using them quite effectively for years. I just
don't think the same thing can be said for the typical wget user. We've
already had disagreements in this chain about what would match a particular
regular expression; I suspect everyone involved in the conversation could
have correctly predicted what the equivalent glob would do.

I don't think ,r complicates the command that much. Internally, the only
additional work for supporting both globs and regular expressions is a
function that converts a glob into a regexp when ,r is not requested.
That's a straightforward transformation.

Tony



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't see a clear line that connects --filter to glob patterns as used
 by the shell.

I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf

Note that *.pdf is not a valid regular expression even though it's what
most people will try naturally. Perl complains:
/*.pdf/: ?+*{} follows nothing in regexp

I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.

Tony



RE: regex support RFC

2006-03-30 Thread Tony Lewis
How many keywords do we need to provide maximum flexibility on the
components of the URI? (I'm thinking we need five.)

Consider http://www.example.com/path/to/script.cgi?foo=bar

--filter=uri:regex could match against any part of the URI
--filter=domain:regex could match against www.example.com
--filter=path:regex could match against /path/to/script.cgi
--filter=file:regex could match against script.cgi
--filter=query:regex could match against foo=bar

I think there are good arguments for and against matching against the file
name in path:

Tony



RE: regex support RFC

2006-03-30 Thread Tony Lewis
Curtis Hatter wrote:

 Also any way to add modifiers to the regexs? 

Perhaps --filter=path,i:/path/to/krs would work.

Tony



RE: Bug in ETA code on x64

2006-03-28 Thread Tony Lewis
Hrvoje Niksic wrote:

 The cast to int looks like someone was trying to remove a warning and
 botched operator precedence in the process.

I can't see any good reason to use , here. Why not write the line as:
  eta_hrs = eta / 3600; eta %= 3600;

This makes it much less likely that someone will make a coding error while
editing that section of code.

Tony



RE: wget option (idea for recursive ftp/globbing)

2006-03-02 Thread Tony Lewis
Mauro Tortonesi wrote: 

 i would like to read other users' opinion before deciding which
 course of action to take, though.

Other users have suggested adding a command line option for -a two or
three times in the past:

- 2002-11-24: Steve Friedl [EMAIL PROTECTED] submitted a patch
- 2002-12-24: Maaged Mazyek [EMAIL PROTECTED] submitted a patch
- 2005-05-09: B Wooster [EMAIL PROTECTED] asked if the fix was ever
going to be implemented
- 2005-08-19: Carl G. Ponder [EMAIL PROTECTED] asked if the patches
were going to be applied
- 2005-08-20: Hrvoje responded by posting his own patch for --list-options

(and that's just what I can find in my local archive searching for list
-a)

There is clearly a need among the user community for a feature like this and
lots of ideas about how to implement it. I'd say you should pick one and
implement it.

If you need copies of any of the patches mentioned in the list above, let me
know.

Tony



RE: wget 1.10.x fixed recursive ftp download over proxy

2006-01-09 Thread Tony Lewis



I believe the following simplified code would have the same 
effect:
if ((opt.recursive || opt.page_requisites || opt.use_proxy)  url_scheme 
(*t) != SCHEME_FTP) status = retrieve_tree (*t);else 
status = retrieve_url  (*t, filename, redirected_URL, NULL, 
dt);
Tony



From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of CHEN PengSent: 
Monday, January 09, 2006 12:38 AMTo: 
[EMAIL PROTECTED]Subject: wget 1.10.x fixed recursive ftp download 
over proxy
Hi,
We once encounter an annoying problem of recursively downloading FTP data 
using wget, through a ftp-over-http proxy. Previously it was the proxy firmware 
that does not support recursive downloads, but even upgrading we realized there 
is problem with wget itself as well.
We found that with new proxy firmwire, the older wget 1.7.x can download FTP 
database recursively, but the newer version (1.9.x and 1.10.x) can not. That 
means there must be something wrong with the code.
I also confirmed this is a known bug for wget since 2003 and it is strange it 
has not been fixed for a long time.
To fix this problem, I took some time to analyze its code and it happens wget 
uses different method to get the list of files for a destination folder when 
trying to do recursive download. For normal FTP, it uses FTP command "LIST" to 
get the file listing. For normal HTTP, it uses its internal method 
"retrieve_tree()" to generate the lists. 
In main.c, it does to use retrieve_tree() function to generate list if the 
traffic is FTP. Howerver, when we use ftp-over-http proxy, the actual request to 
the server is HTTP request, where the "LIST" FTP command wont work, so we only 
get one "index.html" file.
if ((opt.recursive || opt.page_requisites)  
url_scheme (*t) != SCHEME_FTP) status = retrieve_tree 
(*t);else status = retrieve_url  (*t, filename, 
redirected_URL, NULL, dt);
In this scenario, we need to modify the code to force wget call retrieve_tree 
function for FTP traffic if the proxy is involved 
if ((opt.recursive || opt.page_requisites)//  url_scheme (*t) != 
SCHEME_FTP) ((url_scheme (*t) != 
SCHEME_FTP) ||  (opt.use_proxy 
 url_scheme (*t) == SCHEME_FTP))) status = 
retrieve_tree (*t);else status = retrieve_url  (*t, 
filename, redirected_URL, NULL, dt);
After patching the main.c, the new wget works perfectly for FTP recursive 
downloading, both with proxy and without proxy. This patching works for 1.9.x 
and 1.10.x till the latest version so far (1.10.2).-- CHEN Peng [EMAIL PROTECTED] 



RE: spaces in pathnames using --directory-prefix=prefix

2005-11-30 Thread Tony Lewis
Jonathan DeGumbia wrote:
 
 I'm trying to use the --directory-prefix=prefix option for wget on a
 Windows system.  My prefix has spaces in the path directories.  Wget
 appears to terminate the path at the first space encountered.   In other
 words if my prefix is: c:/my prefix/   then wget copies files to c:/my/ .

 Is there a work-around for this?

wget is not terminating the path at the command line delimiter, Windows is.
In the same way that you have to enter:
dir c:\my prefix
to list the contents of the directory, you have to enter:
wget --directory-prefix=c:/my prefix
or the command processor will split the directory path at the space before
passing it to wget.

Tony




RE: Error connecting to target server

2005-11-11 Thread Tony Lewis
[EMAIL PROTECTED] wrote:

 Thanks for your reply. Only ping works for bbc.com and not wget.

When I issue the command wget www.bbc.com, it successfully downloads the
following file:

!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 3.2//EN
HTML
HEAD
META HTTP-EQUIV=Refresh content=0; URL=http://www.bbc.co.uk/?ok;
TITLEBritish Broadcasting Corporation /TITLE
/HEAD
BODY BGCOLOR=white
/BODY
/HTML 


You might want to try wget http://www.bbc.co.uk;.

I think http://www.gnu.org/software/wget/faq.html should have another
question: Why did my download fail and how can I get it to work? In the
answer to that question we should mention all the common failure modes:
disallowed by robots.txt, need to set user agent to look like a browser,
META refresh (as above), etc. along with the command line options to resolve
the failure.

Also, perhaps the next version of wget can handle META refresh.

Tony




RE: wget can't handle large files

2005-10-18 Thread Tony Lewis
Eberhard Wolff wrote: 

 Apparently wget can't handle large file.
[snip]
 wget --version GNU Wget 1.8.2

This bug was fixed in version 1.10 of wget. You should obtain a copy of
the latest version, 1.10.2.

Tony




RE: Wget patches for .files

2005-08-19 Thread Tony Lewis
Mauro Tortonesi wrote: 

 this is a very interesting point, but the patch you mentioned above uses
the
 LIST -a FTP command, which AFAIK is not supported by all FTP servers.

As I recall, that's why the patch was not accepted. However, it would be
useful if there were some command line option to affect the LIST parameters.
Perhaps something like:

wget ftp://ftp.somesite.com --ftp-list=-a

Tony




RE: wget a file with long path on Windows XP

2005-07-21 Thread Tony Lewis
PoWah Wong wrote: 

 The login page is:
 http://safari.informit.com/?FPI=uicode=

 How to figure out the login command?

 These two commands do not work:

 wget --save-cookies cookies.txt http://safari.informit.com/?FPI= [snip]
 wget --save-cookies cookies.txt
http://safari.informit.com/?FPI=uicode=/login.php? [snip]

When trying to recreate a form in wget, you have to send the data the server
is expecting to receive to the location the server is expecting to receive
it. You have to look at the login page for the login form and recreate it.
In your browser, view the source to http://safari.informit.com/?FPI=uicode=
and you will find the form that appears below. Note that I stripped out
formatting information for the table that contains the form and reformatted
what was left to make it readable.

form action=JVXSL.asp method=post
  input type=hidden name=s value=1
  input type=hidden name=o value=1
  input type=hidden name=b value=1
  input type=hidden name=t value=1
  input type=hidden name=f value=1
  input type=hidden name=c value=1
  input type=hidden name=u value=1
  input type=hidden name=r value=
  input type=hidden name=l value=1
  input type=hidden name=g value=
  input type=hidden name=n value=1
  input type=hidden name=d value=1
  input type=hidden name=a value=0
  input tabindex=1 name=usr id=usr type=text value= size=12
  input name=pwd id=pwd tabindex=1 type=password value=
size=12
  input type=checkbox tabindex=1 name=savepwd id=savepwd value=1
  input type=image name=Login src=images/btn_login.gif alt=Login
width=40 height=16 border=0 tabindex=1 align=absmiddle
/form

Note that the server expects the data to be posted to JVXSL.asp and that
there are a bunch of fields that must be supplied in order for the server to
process the login request. In addition, the two fields you supply are called
usr and pwd. So your first wget command line will look something like
this:

wget --save-cookies cookies.txt http://safari.informit.com/JVXSL.asp;
--post-data=s=1o=1b=1t=1f=1c=1u=1r=l=1g=n=1d=1a=0usr=wong_powa
[EMAIL PROTECTED]pwd=123savepwd=1

Hope that helps!

Tony




RE: connect to server/request multiple pages

2005-07-21 Thread Tony Lewis



Pat Malatack wrote:
 is there a 
way to stay connected, because it seems to me that this takes a decent amount of 
time that could be minimized

The following 
command will do what you want:


wget "google.com/news" 
"google.com/froogle"

Tony


RE: Invalid directory names created by wget

2005-07-08 Thread Tony Lewis
Larry Jones wrote: 

 Of course it's directly accessible -- you just have to quote it to keep
the
 shell from processing the parentheses:

   cd 'title.Die-Struck+(Gold+on+Gold)+Lapel+Pins'

You can also make the individual characters into literals:

cd title.Die-Struck+\(Gold+on+Gold\)+Lapel+Pins

Tony




Name or service not known error

2005-06-27 Thread Tony Lewis



I got a "Name or 
service not known" error from wget 1.10 running on Linux. When I installed an 
earlier version of wget, it worked just fine.It also works just fine on 
version 1.10 running on Windows. Any ideas?

Here's the output on 
Linux:

wget --versionGNU Wget 1.9-beta1

wget http://www.calottery.com/Games/MegaMillions/--17:29:59-- 
http://www.calottery.com/Games/MegaMillions/ 
= `index.html.8'Resolving www.calottery.com... 64.164.108.164, 
64.164.108.202Connecting to www.calottery.com[64.164.108.164]:80... 
connected.HTTP request sent, awaiting response... 200 OKLength: 45,166 
[text/html]

100%[==] 
45,166 166.21K/s

17:30:01 (166.17 KB/s) - `index.html.8' saved 
[45166/45166]

snip output from 
make install of 1.10 here

wget --versionGNU Wget 1.10

wget http://www.calottery.com/Games/MegaMillions/--17:30:17-- 
http://www.calottery.com/Games/MegaMillions/ 
= `index.html.9'Resolving www.calottery.com... failed: Name or service 
not known.


RE: Removing thousand separators from file size output

2005-06-24 Thread Tony Lewis
Hrvoje Niksic wrote: 

 In fact, I know of no application that accepts numbers as Wget prints
them.

Microsoft Calculator does.

Tony




RE: Is it just that the -m (mirror) option an impossible task [Was: wget 1.91 skips most files]

2005-05-28 Thread Tony Lewis
Maurice Volaski wrote:

 wget's -m option seems to be able to ignore most of the files it should
 download from a site. Is this simply because wget can download only the
 files it can see? That is, if the web server's directory indexing option
 is off and a page on the site is present on the server, but it isn't
 referenced by any publicly viewable page, wget simply can't see it. 

I've been thinking about coding a --extra-sensory-perception option that
would cause wget to read the mind of the server so that it can download
files that it cannot see. As soon as I get the algorithm worked out, I'll be
submitting the patch. So far I've figured out how to download index.html
without being able to see it, but I'm sure that if I keep working at it that
wget will be able to detect the rest of the files it cannot see. Of course,
I could just be taking the wrong approach; it may work better if I try to
implement the --psychic option instead.

Tony




RE: links conversion; non-existent index.html

2005-05-01 Thread Tony Lewis
Andrzej wrote:

 Two problems:

 There is no index.html under this link:
 http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
[snip]
 it creates a non existing link:
 http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html

When you specify a directory, it is up to the web server to determine what
resource gets returned. Some web servers will return a directory listing,
some will return some file (such as index.html), and others will return an
error.

For example, Apache might return (in this order): index.html, index.htm, a
directory listing (or a 403 Forbidden response if the configuration
disallows directory listings). The actual list of files that Apache will
search for and the order in which they are selected is determined by the
configuration.

If the web server returns any information, wget has to save the information
that is returned in *some* local file. It chooses to name that local file
index.html since it has no way of knowing where the information might have
actually been stored on the server.

Hope that helps,

Tony





RE: SSL options

2005-04-21 Thread Tony Lewis
Hrvoje Niksic wrote:

 The question is what should we do for 1.10?  Document the
 unreadable names and cryptic values, and have to support
 them until eternity?

My vote is to change them to more reasonable syntax (as you suggested
earlier in the note) for 1.10 and include the new syntax in the
documentation. However, I think wget should to continue to support the old
options and syntax as alternatives in case people have included them in
scripts.

Tony




RE: newbie question

2005-04-14 Thread Tony Lewis
Alan Thomas wrote:

 I am having trouble getting the files I want using a wildcard specifier...

There are no options on the command line for what you're attempting to do.

Neither wget nor the server you're contacting understand *.pdf in a URI.
In the case of wget, it is designed to read web pages (HTML files) and then
collect a list of resources that are referenced in those pages, which it
then retrieves. In the case of the web server, it is designed to return
individual objects on request (X.pdf or Y.pdf, but not *.pdf). Some web
servers will return a list of files if you specify a directory, but you
already tried that in your first use case.

Try coming at this from a different direction. If you were going to manually
download every PDF from that directory, how would YOU figure out the names
of each one? Is there a web page that contains a list somewhere? If so,
point wget there.

Hope that helps.

Tony

PS) Jens was mistaken when he said that https requires you to log into the
server. Some servers may require authentication before returning information
over a secure (https) channel, but that is not a given.




RE: File rejection is not working

2005-04-06 Thread Tony Lewis
Jens Rösner wrote: 

 AFAIK, RegExp for (HTML?) file rejection was requested a few times, but is
 not implemented at the moment.

It seems all the examples people are sending are just attempting to get a
match that is not case sensitive. A switch to ignore case in the file name
match would be a lot easier to implement than regular expressions and solve
the most pressing need.

Just a thought.

Tony




RE: help!!!

2005-03-21 Thread Tony Lewis
The --post-data option was added in version 1.9. You need to upgrade your
version of wget. 

Tony
-Original Message-
From: Richard Emanilov [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 21, 2005 8:49 AM
To: Tony Lewis; [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: help!!!

wget --http-user=login --http-passwd=passwd
--post-data=login=loginpassword=passwd https://site

wget: unrecognized option `--post-data=login=loginpassword=password'
Usage: wget [OPTION]... [URL]... 


wget --http-user=login --http-passwd=passwd
--http-post=login=loginpassword=password https:site
wget: unrecognized option `--http-post=login=loginpassword=passwd'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

wget -V
GNU Wget 1.8.2


Richard Emanilov
 
[EMAIL PROTECTED]


-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Monday, March 21, 2005 10:26 AM
To: wget@sunsite.dk
Cc: Richard Emanilov
Subject: RE: help!!!

Richard Emanilov wrote:

 Below is what I have tried with no success

 wget --http-user=login --http-passwd=passwd
--http-post=login=loginpassword=passwd

That should be:
wget --http-user=login --http-passwd=passwd
--post-data=login=loginpassword=passwd

Tony






RE: Curb maximum size of headers

2005-03-17 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't see how and why a web site would generate headers (not bodies, to
 be sure) larger than 64k.

To be honest, I'm less concerned about the 64K header limit than I am about
limiting a header line to 4096 bytes. I don't know any sites that send back
header lines that long, but they could. Who's to say some site doesn't have
a 4K cookie?

Since you already proposing to limit the entire header to 64K, what is
gained by adding this second limit?

Tony




RE: one bug?

2005-03-04 Thread Tony Lewis



Jesus Legido wrote:

 I'm getting a file from https://mfi-assets.ecb.int/dla/EA/ea_all_050303.txt:

The problem is not with wget. The file on the server 
starts with 0xFF 0xFE. Put the following into an HTML file (say temp.html) on 
your hard drive, open it in your web browser, right click on the link and do a 
"Save As..." to your hard drive. You will get the same thing as wget 
downloaded.

htmlbodya href="">ea.txthttps://mfi-assets.ecb.int/dla/EA/ea_all_050303.txt"ea.txt/a/body/html



RE: wget: question about tag

2005-02-02 Thread Tony Lewis
 
Normand Savard wrote:

 I have a question about wget.  Is is possible to download other attribute
 value other than the harcoded ones?

No, at least not in the existing versions of wget. I have not heard that
anyone is working on such an enhancement.




RE: new string module

2005-01-05 Thread Tony Lewis
Mauro Tortonesi wrote:

 Alle 18:28, mercoled 5 gennaio 2005, Draen Kacar ha scritto:
  Jan Minar wrote:
   What's wrong with mbrtowc(3) and friends?  The mysterious solution 
   is probably to use wprintf(3) instead printf(3).  Couple of 
   questions on #c on freenode would give you that answer.
 
  Historically, wget source was written in a way which allowed one to 
  compile it on really old systems. That would rule out C95 functions.
 
  (I'm not advocating this approach, just answering the question.)
 
 as long as i am the maintainer of wget, backward compatibility on very old or 
 legacy systems will NOT be broken.

I don't think it has be an either/or situation. With well-selected #if 
statements, you should be able to have something that works on legacy systems 
while still providing wide character support on more modern operating systems.

I'm not volunteering to determine what those #if statements might be :-) ... 
just pointing out the possibility.

Tony




RE: Metric units

2004-12-23 Thread Tony Lewis
John J Foerch wrote:

 It seems that the system of using the metric prefixes for numbers 2^n is a
 simple accident of history.  Any thoughts on this?

I would say that the practice of using powers of 10 for K and M is a
response to people who cannot think in binary.

Tony




RE: Metric units

2004-12-23 Thread Tony Lewis
Carlos Villegas snidely wrote:

 I would say that the original poster understands what he is saying, and
you clearly don't...

I'll put my computer science degree up against your business administration
and accounting degree any day.

A kilobyte has always been 1024 bytes and the choice was not accidental.
Computer memory is laid out in bits, which are always powers of two.

There are 10 kinds of people in the world; those who understand binary and
those who don't.

Tony





RE: Metric units

2004-12-23 Thread Tony Lewis
Mark Post wrote: 

 While we're at it, why don't we just round off the value of pi to be 3.0

Do you live in Indiana?

Actually, Dr. Edwin Goodwin wanted to round off pi to any of several values
including 3.2.

http://www.agecon.purdue.edu/crd/Localgov/Second%20Level%20pages/Indiana_Pi_
Story.htm

Tony




RE: date based retrieval

2004-12-19 Thread Tony Lewis
Anthony Caetano wrote:

 I am looking for a way to stay current without mirroring an entire site.

[snip]
 Does anyone else see a use for this?

Yes. Here's my non-wget solution. I truncate all the files in the
directories that I don't want, but maintain the date/time accessed and
modified. The Perl script I use follows. --Tony

#!/usr/bin/perl

$dir = shift @ARGV;
die No such directory: $dir\n unless -d $dir;

$nFiles = 0;
my @dirs = ($dir);
while (scalar(@dirs)) {
  processDir(shift @dirs);
}
print Truncated $nFiles files.\n;

sub processDir
{
  my ($dir) = @_;
  opendir DIR, $dir or die Cannot read directory: $dir\n;
  foreach $file (readdir DIR)
  {
next if $file eq . || $file eq ..;

$path = $dir/$file;
if (-d $path) {
  push @dirs, $path;
} else {
  my @stat = stat($path);
  $nFiles++;
  open F, $path or next;
  close F;
  utime $stat[8], $stat[9], $path;
}
  }
  closedir DIR;
} 




Re: wput mailing list

2004-08-29 Thread Tony Lewis
Justin Gombos wrote:
Since I feel that computers serve man, not the reverse, so I don't
intend to change my file organization to be web page centric.  Looking
around the web, I was quite surprized to find that I'm the only one
with this problem.  I was very relieved to find that there was a wput
- then disappointed to find that wput doesn't reverse one of the most
important capabilities of wget; that is, the ability to selectively
mirror only files that are needed.
Sounds like you've got a legitimate feature request for wput. You should 
submit it to *their* mailing list. Even better, borrow the relevant code 
from wget and upgrade wput yourself.

Tony 



Re: Stratus VOS support

2004-07-28 Thread Tony Lewis
Stratus VOS supportJonathan Grubb wrote:

 Any thoughts of adding support for Stratus VOS file structures?

Your question is a little too vague -- even for me (I used to work for
Stratus and actually know what VOS is :-)

What file structures are you needing supported that wget does not currently
support? Are you needing support when wget is running on Stratus and saving
files that it downloads from somewhere else? Or are you needing support for
VOS file structures when wget is retrieving a file from a VOS system? If the
latter, does this support only apply if both systems are VOS?

Tony



Re: Stratus VOS support

2004-07-28 Thread Tony Lewis
Jonathan Grubb wrote:


 Um. I'm using wget on Win2000 to ftp to a VOS machine. I'm finding that
the
 usual '' sign for directories isn't supported by wget and that '/'
doesn't
 work either, I think because the ftp server itself is expecting ''.

The problem may be that Win 2000 grabs the  before wget ever sees it. Try
putting the path in quotes.

Tony



Re: retrieve a whole website with image embedded in css

2004-07-13 Thread Tony Lewis
Ploc wrote:

 The result is a website very different from the original one as it lacks
 backgrounds.

 Can you please confirm if what I think is true (or not), if it is
 registered as a bug, and if there is a date planning to correct it.

It is true. wget only retrieves objects that appear in the HTML. It does not
parse the CSS or JavaScript used by a site.

Tony



Re: retrieve a whole website with image embedded in css

2004-07-13 Thread Tony Lewis
Ploc wrote:

 Is it already registered as a bug or in a whishlist ?

It's not a bug. This feature has been on the wishlist for a long time.

Tony


Re: question on wget via http proxy

2004-07-12 Thread Tony Lewis
Malte Schünemann wrote:

 Since wget is able to obtain directoy listings / retrieve data from
 there is should be possible to also upload data

Then it would be wput. :-)

 What is so special about wget that it is able to perform this task?

You can learn a LOT about how wget is communicating with the target site by
using the --debug argument.

Hope that helps a little.

Tony



Re: Escaping semicolons (actually Ampersands)

2004-06-28 Thread Tony Lewis
Phil Endecott wrote:


 Tony The stuff between the quotes following HREF is not HTML; it
 Tony is a URL. Hence, it must follow URL rules not HTML rules.

 No, it's both a URL and HTML.  It must follow both rules.

 Please see the page that I cited in my previous message:
 http://www.htmlhelp.com/tools/validator/problems.html#amp

I've looked at hundreds of web pages and I've never seen anyone put amp;
into HREF in  place of an ampersand.

Tony



Re: Escaping semicolons

2004-06-27 Thread Tony Lewis
Phil Endecott wrote:

 There is not much to go on in terms of specifications.  The closest is
 RFC1738, which includes BNF for a file: URI.  However it is ten years
 old, so whether it reflects current practice I do not know.  But it does
 not allow ; in file: URIs.

 I conclude from this that wget should be replacing ; with its %3b escape
 sequence.

I think you're confusing what wget is required to do with URLs entered on
the command line and what it chooses to do with the resulting files that it
saves. If a unencoded name of retrieved resource cannot be stored on the
local file system, wget encodes it to create a valid name.

 Tony Lewis wrote:
   I use semicolons in CGI URIs to separate parameters.  (Ampersand
   is more often used for this, but semicolon is also allowed and
   has the advantage that there is no need to escape it in HTML.)
 
  There is no need to escape ampersands either.

 Tony, are you suggesting that this is legal HTML?

   a href=http://foo.foo/foo.cgi?p1=v1p2=v2;Foo/a

 I'm fairly confident that you need to escape the  to make it valid, i.e.

   a href=http://foo.foo/foo.cgi?p1=v1amp;p2=v2;Foo/a

Just out of curiosity, did you try to implement your theory and see what
happens? If you did, you would that the first version works and the second
does not.

By the way, the correct URI encoding of ampersand is %26, not amp;. The
latter encoding is used for ampersands in HTML markup.

With regard to whether ampersand needs to be encoded, you're misreading the
RFC:

   Many URL schemes reserve certain characters for a special meaning:
   their appearance in the scheme-specific part of the URL has a
   designated semantics. If the character corresponding to an octet is
   reserved in a scheme, the octet must be encoded.  The characters ;,
   /, ?, :, @, = and  are the characters which may be
   reserved for special meaning within a scheme. No other characters may
   be reserved within a scheme.

   Usually a URL has the same interpretation when an octet is
   represented by a character and when it encoded. However, this is not
   true for reserved characters: encoding a character reserved for a
   particular scheme may change the semantics of a URL.

The RFC says that you have to escape Reserved characters if that character
appears in the name of the resource you're trying to retrieve. That is, if
you're trying to retrieve a file named ab.txt, you refer to that file as
a%26b.txt in the URL because you're using the ampersand for a non-reserved
purpose.

If you're using a reserved character for the purpose that it has been
reserved (in this case, separating parameters), you do NOT want to encode
it. The URL you proposed (after correcting the encoding of the ampersand) is
requesting a resource (probably a file) whose name is foo.cgi?p1=v1p2=v2.
It is NOT requesting that the script foo.cgi be executed with argument p1
having a value of v1 and p2 having a value of v2.

Hope that helps.

Tony



Re: Escaping semicolons

2004-06-24 Thread Tony Lewis
Phil Endecott wrote:

 I am using wget to build a downloadable zip file for offline viewing of
a CGI-intensive web site that I am building.

 Essentially it works, but I am encountering difficulties with semicolons.
I use semicolons in CGI URIs to separate parameters.  (Ampersand is more
often used for this, but semicolon is also allowed and has the advantage
that there is no need to escape it in HTML.)

There is no need to escape ampersands either.

Tony



Re: file name problem

2004-06-01 Thread Tony Lewis
henry luo wrote:

 i find a problem at GNU Wget 1.9.1, but i dont know it is a new
 function or a bug;
 the old version(1.8.2) download a link ,for example:

 wget

'http://www.expekt.com/odds/eventsodds.jsp?range=100sortby=dateactive=
bettingbetcategoryId=SOC%25'


 save file name is

eventsodds.jsp?range=100sortby=dateactive=bettingbetcategoryId=SOC%2
5

but the new version(1.9.1) save name is

eventsodds.jsp?range=100sortby=dateactive=bettingbetcategoryId=SOC%

It is a feature. The latest version of wget converts %nn to the appropriate
character *if* that character is valid in a filename on the target system.
In this case, %25 converts to %, which can appear in a filename.
The --restrict-file-names option gives you some control over this which
characters are escaped, but it does not appear to provide the functionality
you're looking for:

--restrict-file-names=MODE'
 Change which characters found in remote URLs may show up in local
 file names generated from those URLs.  Characters that are
 restricted by this option are escaped, i.e. replaced with `%HH',
 where `HH' is the hexadecimal number that corresponds to the
 restricted character.

 By default, Wget escapes the characters that are not valid as part
 of file names on your operating system, as well as control
 characters that are typically unprintable.  This option is useful
 for changing these defaults, either because you are downloading to
 a non-native partition, or because you want to disable escaping of
 the control characters.

 When mode is set to unix, Wget escapes the character `/' and the
 control characters in the ranges 0-31 and 128-159.  This is the
 default on Unix-like OS'es.

 When mode is seto to windows, Wget escapes the characters `\',
 `|', `/', `:', `?', `', `*', `', `', and the control characters
 in the ranges 0-31 and 128-159.  In addition to this, Wget in
 Windows mode uses `+' instead of `:' to separate host and port in
 local file names, and uses `@' instead of `?' to separate the
 query portion of the file name from the rest.  Therefore, a URL
 that would be saved as `www.xemacs.org:4300/search.pl?input=blah'
 in Unix mode would be saved as
 `www.xemacs.org+4300/[EMAIL PROTECTED]' in Windows mode.  This
 mode is the default on Windows.

 If you append `,nocontrol' to the mode, as in `unix,nocontrol',
 escaping of the control characters is also switched off.  You can
 use `--restrict-file-names=nocontrol' to turn off escaping of
 control characters without affecting the choice of the OS to use
 as file name restriction mode.



Re: OpenVMS URL

2004-05-27 Thread Tony Lewis
Hrvoje Niksic wrote:

 Wget could always support a URL parameter, such as:

 wget 'ftp://server/dir1/dir2/file;disk=foo'


Assuming, you can detect a VMS connection, why not simply
ftp://server/foo:[dir1.dir2]?

Tony



Re: OpenVMS URL

2004-05-26 Thread Tony Lewis
How do you enter the path in your web browser?
- Original Message - 
From: Bufford, Benjamin (AGRE) [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 7:32 AM
Subject: OpenVMS URL



I am trying to use wget to retrieve a file from an OpenVMS server but have
been unable to make wget to process a path with a volume name in it.  For
example:

disk:[directory.subdirectory]filename

How would I go about entering this type of path in a way that wget can
understand?



**
The information contained in, or attached to, this e-mail, may contain
confidential information and is intended solely for the use of the
individual or entity to whom they are addressed and may be subject to legal
privilege.  If you have received this e-mail in error you should notify the
sender immediately by reply e-mail, delete the message from your system and
notify your system manager.  Please do not copy it for any purpose, or
disclose its contents to any other person.  The views or opinions presented
in this e-mail are solely those of the author and do not necessarily
represent those of the company.  The recipient should check this e-mail and
any attachments for the presence of viruses.  The company accepts no
liability for any damage caused, directly or indirectly, by any virus
transmitted in this email.
**



  1   2   >