Re: OpenVMS URL

2004-05-26 Thread Tony Lewis
Then your problem isn't with wget. Once you figure out how to access the
file in a web browser, use the same URL in wget.

Tony
- Original Message - 
From: Bufford, Benjamin (AGRE) [EMAIL PROTECTED]
To: Tony Lewis [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 8:41 AM
Subject: RE: OpenVMS URL



That's the problem I'm having.  With all the looking and reading I've done I
haven't found a way to specify the type of pathname I used as an example
(disk:[directory.subdirectory]filename) as a URL for a broswer or anything
else that requires a URL to retrieve things over ftp.

-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 11:08 AM
To: Bufford, Benjamin (AGRE); [EMAIL PROTECTED]
Subject: Re: OpenVMS URL


How do you enter the path in your web browser?
- Original Message -
From: Bufford, Benjamin (AGRE) [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, May 26, 2004 7:32 AM
Subject: OpenVMS URL



I am trying to use wget to retrieve a file from an OpenVMS server but have
been unable to make wget to process a path with a volume name in it.  For
example:

disk:[directory.subdirectory]filename

How would I go about entering this type of path in a way that wget can
understand?



**
The information contained in, or attached to, this e-mail, may contain
confidential information and is intended solely for the use of the
individual or entity to whom they are addressed and may be subject to legal
privilege.  If you have received this e-mail in error you should notify the
sender immediately by reply e-mail, delete the message from your system and
notify your system manager.  Please do not copy it for any purpose, or
disclose its contents to any other person.  The views or opinions presented
in this e-mail are solely those of the author and do not necessarily
represent those of the company.  The recipient should check this e-mail and
any attachments for the presence of viruses.  The company accepts no
liability for any damage caused, directly or indirectly, by any virus
transmitted in this email.
**


**
The information contained in, or attached to, this e-mail, may contain
confidential information and is intended solely for the use of the
individual or entity to whom they are addressed and may be subject to legal
privilege.  If you have received this e-mail in error you should notify the
sender immediately by reply e-mail, delete the message from your system and
notify your system manager.  Please do not copy it for any purpose, or
disclose its contents to any other person.  The views or opinions presented
in this e-mail are solely those of the author and do not necessarily
represent those of the company.  The recipient should check this e-mail and
any attachments for the presence of viruses.  The company accepts no
liability for any damage caused, directly or indirectly, by any virus
transmitted in this email.
**



Re: trouble with encoded filename

2004-04-08 Thread Tony Lewis
[EMAIL PROTECTED] wrote:

 Well, I found out a little bit more about the
 real reason for the problem. Opera has a very
 convenient option called Encode International
 Web Addresses with UTF-8. When I had this
 option checked, it could retrieve the file
 without problems. Without this option enabled,
 I get the same forbidden response that I
 received when I used wget.

 In my never-humble opinion, wget needs this
 ability also. I had hoped that using the option
 --restrict-file-names=nocontrol would have
 disabled encoding of the URL, but apparently,
 it does not.

Huh?

Opera is doing special encoding for some types of web addresses and you
hoped that disabling ALL encoding would somehow make wget do the same thing?

If special encoding is required: 1) someone has to write the code in wget to
perform that encoding and 2) it has to be ENabled (not DISabled).

Tony



Re: Problem Accessing FTP Site Where Password Contains @

2004-03-09 Thread Tony Lewis
[EMAIL PROTECTED] wrote:
 I came across a problem accessing an FTP site where
 the password contained a @ sign.  The password was
 [EMAIL PROTECTED]  So I tried the following:
 
 wget -np --server-response -H --tries=1 -c
 --wait=60 --retry-connrefused -R *
 ftp://guest:[EMAIL PROTECTED]@83.21.191.254:21/document.rar

Try ftp://guest:1nDi:[EMAIL PROTECTED]:21/document.rar

In other words: username COLON password AT-SIGN address


Re: not downloading at all, help

2004-02-12 Thread Tony Lewis
Juhana Sadeharju wrote:

 I placed use_proxy = off to .wgetrc (which file I did not have earlier)
 and to ~/wget/etc/wgetrc (which file I had), and tried
   wget --proxy=off http://www.maqamworld.com
 and it still does not work.

 Could there be some system wgetrc files somewhere? I have compiled
 wget on my own to my home directory, and certainly wish that my own
 installation does not use files of some other installation.

 Why did you think the :80 comes from proxy? I have always thought
 it comes from the target site, not from our site. Did you try the
 given command yourself and it worked? Please try now if you did not.

 If wget puts the :80 , then how do I instruct wget to not do that
 no matter what is told somewhere? What part of the source code I should
 edit if that is only what helps?

 Though, you should fix this to the wget source because something is
 not working now. I wonder why this not working is set as a default
 behaviour to wget...

In the communications world where two computers are talking to one another,
there is no such thing as http://www.maqamworld.com or
http://www.maqamworld.com:80. Those are simply convenient (and readable)
notations for the human beings that use the computers. Run wget with the -d
option and you will see how the computers break all that down:

DEBUG output created by Wget 1.9-beta1 on linux-gnu.

--08:23:18--  http://www.maqamworld.com/
   = `index.html'
Resolving www.maqamworld.com... 66.48.76.90
Caching www.maqamworld.com = 66.48.76.90
Connecting to www.maqamworld.com[66.48.76.90]:80... connected.
Created socket 3.
Releasing 0x81164d8 (new refcount 1).
---request begin---
GET / HTTP/1.0
User-Agent: Wget/1.9-beta1
Host: www.maqamworld.com
Accept: */*
Connection: Keep-Alive

wget does a domain name look up on www.maqamworld.com and finds that it
resides at 66.48.76.90. It then opens a socket to that IP address on port
80. (Port 80 is the default port for the HTTP protocol specified in the
first part of the URL -- it's what is used by almost all web sites and
browsers).

Now that the connection is made between your computer and the server, wget
sends a GET request (part of the HTTP protocol) to the server. Included in
that request is the name of the site being retrieved Host:
www.maqamworld.com, but the port number is never sent by wget to the
server.

By the way, when I ran this, wget created an index.html file that looks
reasonable to me. It is 23,335 bytes long and is identical to what I get if
I do a View Source in my browser and save the text file.

Run the following command and send the output to the list if you continue to
have problems: wget http://www.maqamworld.com -d

Tony



Re: Startup delay on Windows

2004-02-08 Thread Tony Lewis
Hrvoje Niksic wrote:

 Does anyone have an idea what we should consider the home dir under
 Windows, and how to find it?

On Windows 2000 and XP, there are two environment variables that together
provide the user's home directory. (It may go back further than that, but I
don't have any machines running older OS versions to confirm that.) For
example, on my Windows XP machine, I have to following variables:

HOMEDRIVE=C:
HOMEPATH=\Documents and Settings\Tony Lewis

so my home directory is C:\Documents and Settings\Tony Lewis

HTH,

Tony



Re: passing a login and password

2004-01-06 Thread Tony Lewis
robi sen wrote:

 Hi I have a client who basically needs to regularly grab content from
 part of their website and mirror or it and or save it so they can
 disseminate it as HTML on a CD.  The website though is written in
 ColdFusion as requires application level authentication which is just
 form vars passed to the system called login and one called password.

 Is there a way to do this with wget?  If not I suspect I can add to the
 applications security something that looks for the login and password in
 the URL then I could just make sure to append that to the URL that is
 given to wget.

Later versions of wget support posting of forms. Try: wget
http://www.yourclient.com/somepage.html --http-post=login=userpassword=pw

Tony



Re: IPv6 support of wget v 1.9.1

2003-12-31 Thread Tony Lewis
Kazu Yamamoto wrote:

 Since I have experiences to modify IPv4 only programs, including FTP
 and HTTP, to IPv6-IPv4 one, I know this problem. Yes, some part of
 wget *would* remain protocol dependent.

Kazu, it's been said that a picture is worth a thousand words. Perhaps in
this case, a patch would make your point better.

Happy New Year to all!

Tony



Re: need help

2003-12-30 Thread Tony Lewis
Anurag Jain wrote:

 downloading a bin big file(268MB) using wget command on our solrise
 box using
 wget http url/bin filename which located on some webserver it start
 downloading
 it and after 42% it give a msg no disk space available and it get
 stopped. although i
 check on sever lot more free space is available.

How much free space is available on the partition where wget is writing the
file? That's the most likely place that you're running out of free space.

Tony



Re: IPv6 support of wget v 1.9.1

2003-12-25 Thread Tony Lewis
Kazu Yamamoto wrote:

 Thank you for supporting IPv6 in wget v 1.9.1. Unfortunately, wget v
 1.9.1 does not work well, at least, on NetBSD.

 NetBSD does not allow to use IPv4-mapped IPv6 addresses for security
 reasons. To know the background of this, please refer to:

http://www.ietf.org/internet-drafts/draft-cmetz-v6ops-v4mapped-api-harmful-01.txt

I don't pretend to know much about IPv6, but the document you're quoting is
an Internet Draft that says, we don't think you should implement part of RFC
3493.

However, according to RFC 3493, the recommendations it makes have been
incorporated into POSIX:
 IEEE Std. 1003.1-2001 Standard for Information Technology --
 Portable Operating System Interface (POSIX). Open Group
 Technical Standard: Base Specifications, Issue 6, December 2001.
 ISO/IEC 9945:2002.  http://www.opengroup.org/austin

It seems to me that whether RFC 3493 is correct or not is something that
should be fought in the standards bodies, not in applications like wget.

Tony



Re: IPv6 support of wget v 1.9.1

2003-12-25 Thread Tony Lewis
YOSHIFUJI Hideaki wrote:

 NetBSD etc. is NOT RFC compliant here, however, it would be better if one
 supports wider platforms / configurations.
 My patch is quick hack'ed, but I believe that it should work for NetBSD
 and FreeBSD 5. Please consider applying it.

It's not my call as to whether your patch gets applied or not, but it seems
to me that anything that supports systems that are not RFC compliant should
be enabled from a command line switch unless you can support those systems
without having a negative impact on compliant systems. Perhaps your patch
does that, but I don't know enough about IPv6 to try to make that
assessment.

Tony



Re: fork_to_background() on Windows

2003-12-21 Thread Tony Lewis
Gisle Vanem wrote:

 I've searched google and the only way AFAICS to get redirection
 in a GUI app to work is to create 3 pipes. Then use a thread (or
 run_with_timeout with infinite timeout) to read/write the console
 handles to put/get data into/from the parent's I/O handles. I don't
 fully understand how yet, but it could get messy.

 Just for the sake of running Wget in the background, it doesn't
 seem to be worth. Unless someone else have a better idea.

I agree. One can always open new command windows for wget or another
application to run in.

Tony



Re: wget Suggestion: ability to scan ports BESIDE #80, (like 443) Anyway Thanks for WGET!

2003-12-07 Thread Tony Lewis
- Original Message - 
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Sunday, December 07, 2003 8:04 AM
Subject: wget Suggestion: ability to scan ports BESIDE #80, (like 443)
Anyway Thanks for WGET!

What's wrong with wget https://www.somesite.com ?



Re: question

2003-12-03 Thread Tony Lewis
Danny Linkov wrote:

 I'd like to download recursively the content of a web
 directory WITHOUT AN INDEX file.

What shows up in your web browser if you enter the directory (such as
http://www.somesite.com/dir/)?

The most common responses are:
* some HTML file selected by the server (often index.html, but not always)
* an HTML listing of the directory contents generated by the server
* a 403 Forbidden response

In the first two cases, you can use:

 wget http://www.somesite.com/dir/ --mirror

In the third case, you cannot grab the entire directory using wget. You will
have to construct a list of filenames in a file and then use:

wget --input-file=FILE

Note that the files that are retrieved in this fashion will appear in the
current directory.

Hope that helps. By the way, there is a newer version of wget available
(although my answer is the same for 1.8.2 and 1.9).

Tony



Re: a problem on wgetting a PNG image

2003-11-27 Thread Tony Lewis
[EMAIL PROTECTED] wrote:

 I am not sure if this is a bug, but it's really out of my expectation.

 Here is the way to reproduce the problem.

 1. Put the URL http://ichart.yahoo.com/b?s=CSCO into the browser and
 then drag out the image. It should be a file with .png extension. So I
 believe this is PNG. (is it?!?)

 2. wget -O csco.img http://ichart.yahoo.com/b?s=CSCO;

 Originally, I thought that the files obtained in either way should be
 the same. But the fact is that they are of different sizes. From the
 2nd file's header, it's a gif instead. This is the bigger file.

 Does it mean that wget has transformed the original file? Or did I
 miss any step?

The server changes its behavior based on the user agent. The following will
get back an image/gif:
wget http://ichart.yahoo.com/b?s=CSCO

while the following will get back an image/png:
wget http://ichart.yahoo.com/b?s=CSCO -U Mozilla/4.0 (compatible)

Tony



Re: can you authenticate to a http proxy with a username that contains a space?

2003-11-25 Thread Tony Lewis
antonio taylor wrote:

 http://fisrtname lastname:[EMAIL PROTECTED]

Have you tried http://fisrtname%20lastname:[EMAIL PROTECTED] ?



Re: feature request: --second-guess-the-dns

2003-11-18 Thread Tony Lewis
Hrvoje Niksic wrote:

 Have you seen the rest of the discussion?  Would it do for you if Wget
 correctly handled something like:
 
 wget --header='Host: jidanni.org' http://216.46.192.85/

I think that is an elegant solution.

Tony


Re: Does HTTP allow this?

2003-11-10 Thread Tony Lewis
Hrvoje Niksic wrote:

 Assume that Wget has retrieved a document from the host A, which
 hasn't closed the connection in accordance with Wget's keep-alive
 request.

 Then Wget needs to connect to host B, which is really the same as A
 because the provider uses DNS-based virtual hosts.  Is it OK to reuse
 the connection to A to talk to B?
snip
 FWIW, it works fine with Apache.

There is a fairly high probability that it will work with most hosts
(regardless of the server software). If an IP address has been registered
with multiple hosts, then the address alone is not sufficient to retrieve a
resource so you have to add a Host header.

It's possible that the server responding to the IP address forwards
connections to multiple backend servers. These backend servers may or may
not know about all the resources that the gateway server know about.

Since it will work most of the time, I think it's a reasonable optimization
to use, however you might want to add a --one-host-per-connection flag for
the rare cases where the current behavior won't work.

Tony



Re: Does HTTP allow this?

2003-11-10 Thread Tony Lewis
Hrvoje Niksic wrote:

 The thing is, I don't want to bloat Wget with obscure options to turn
 off even more obscure (and *very* rarely needed) optimizations.  Wget
 has enough command-line options as it is.  If there are cases where
 the optimization doesn't work, I'd rather omit it completely.

It's probably safest to turn off that optimization even if it does eliminate
a few opens now and then.

Tony



Re: The patch list

2003-11-04 Thread Tony Lewis
Hrvoje Niksic wrote:


 I'm curious... is anyone using the patch list to track development?
 I'm posting all my changes to that list, and sometimes it feels a lot
 like talking to myself.  :-)

I read the introductory stuff to see what's changed, but I never extract the
patches from the messages. From my perspective, the introductory stuff plus
a list of affected files would be sufficient.

Tony



Re: Wget 1.8.2 bug

2003-10-17 Thread Tony Lewis
Hrvoje Niksic wrote:

 Incidentally, Wget is not the only browser that has a problem with
 that.  For me, Mozilla is simply showing the source of
 http://www.minskshop.by/cgi-bin/shop.cgi?id=1cookie=set, because
 the returned content-type is text/plain.

On the other hand, Internet Explorer will treat lots of content types as
HTML if the content starts with html.

To see for yourself, try these links:
http://www.exelana.com/test.cgi
http://www.exelana.com/test.cgi?text/plain
http://www.exelana.com/test.cgi?image/jpeg

Perhaps we can add an option to wget so that it will look for an html tag
in plain text files?

Tony



Re: wget downloading a single page when it should recurse

2003-10-17 Thread Tony Lewis
Philip Mateescu wrote:

 A warning message would be nice when for not so obvious reasons wget
 doesn't behave as one would expect.

 I don't know if there are other tags that could change wget's behavior
 (like -r and meta name=robots do), but if they happen it would be
 useful to have a message.

I agree that this is worth a notable mention in the wget output. At the very
least, running with -d should provided more guidance on why the links it has
appended to urlpos are not being followed. Buried in the middle of hundreds
of lines of output is:

no-follow in index.php

On the other hand, if other rules prevent a URL from being followed, you
might see something like:

Deciding whether to enqueue http://www.othersite.com/index.html;.
This is not the same hostname as the parent's (www.othersite.com and
www.thissite.com).
Decided NOT to load it.

Tony



Re: Wget 1.9 about to be released

2003-10-16 Thread Tony Lewis
Hrvoje Niksic wrote:

 I'm about to release 1.9 today, unless it takes more time to upload it
 to ftp.gnu.org.
 
 If there's a serious problem you'd like fixed in 1.9, speak up now or
 be silent until 1.9.1.  :-)

I thought we were going to turn our attention to 1.10. :-)


POST followed by GET

2003-10-14 Thread Tony Lewis
I'm trying to figure out how to do a POST followed by a GET.

If I do something like:

wget http://www.somesite.com/post.cgi --post-data 'a=1b=2' 
http://www.somesite.com/getme.html -d

I get the following behavior:

POST /post.cgi HTTP/1.0
snip
[POST data: a=1b=2]
snip
POST /getme.html HTTP/1.0
snip
[POST data: a=1b=2]

Is this what is expected? Is there a way I can coax wget to POST to post.cgi and GET 
getme.html?

Tony

Re: POST followed by GET

2003-10-14 Thread Tony Lewis
Hrvoje Niksic wrote:

 Maybe the right thing would be for `--post-data' to only apply to the
 URL it precedes, as in:

 wget --post-data=foo URL1 --post-data=bar URL2 URL3

snip
 But I'm not at all sure that it's even possible to do this and keep
 using getopt!

I'll start by saying that I don't know enough about getopt to comment on
whether Hrvoje's suggestion will work.

It's hard to imagine a situation where wget's current behavior makes sense
over multiple URLs. I'm sure someone can come up with an example, but it's
likely to be an unusual case. I see the ability to POST a form as being most
useful when a site requires some kind of form-based authentication to
proceed with looking at other pages within the site.

Some alternatives that occur to me follow.

Alternative #1. Only apply --post-data to the first URL on the command line.
(A simple solution that probably covers the majority of cases.)


Alternative #2. Allow POST and GET as keywords in the URL list so that:

wget POST http://www.somesite.com/post.cgi --post-data 'a=1b=2' GET
http://www.somesite.com/getme.html

would explicitly specify which URL uses POST and which uses GET. If more
than one POST is specified, all use the same --post-data.


Alternative #3. Look for form tags and have --post-file specify the data
to be specified to various forms:

--form-action=URL1 'a=1b=2'
--form-action=URL2 'foo=bar'


Alternative #4. Allow complex sessions to be defined using a session file
such as:

wget --session=somefile --user-agent='my robot'

Options specified on the command line apply to every URL. If somefile
contained:

--post-data 'data=foo' POST URL1
--post-data 'data=bar' POST URL2
--referer=URL3 GET URL4

It would be the same logically equivalent to the following three commands:

wget --user-agent='my robot' --post-data 'data=foo' POST URL1
wget --user-agent='my robot' --post-data 'data=bar' POST URL2
wget --user-agent='my robot' --referer=URL3 GET URL4

with wget's state maintained across the session.

Tony



Re: POST followed by GET

2003-10-14 Thread Tony Lewis
Hrvoje Niksic wrote:

 I like these suggestions.  How about the following: for 1.9, document
 that `--post-data' expects one URL and that its behavior for multiple
 specified URLs might change in a future version.

 Then, for 1.10 we can implement one of the alternative behaviors.

That works for me... I can hardly wait for 1.9 to get wrapped up so we can
start working on 1.10.

Hrvoje, has anyone mentioned how glad we are that you've come back?

Tony



Re: How do you pronounce Hrvoje?

2003-10-12 Thread Tony Lewis
Hrvoje and I have had an off-list dialogue about this subject. We've settled
on HUR-voy-eh as the closest phonetic rendition of his name for English
speakers. It helps to remember that the r is rolled.

Tony



How do you pronounce Hrvoje?

2003-10-11 Thread Tony Lewis
I've been on this list for a couple of years now and I've always wondered
how our illustrious leader pronounces his name.

Can you give us linguistically challenged Americans a phonetic rendition of
your name?

Tony Lewis (toe knee loo iss)



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

 Please be aware that Wget needs to know the size of the POST data
 in advance.  Therefore the argument to @code{--post-file} must be
 a regular file; specifying a FIFO or something like
 @file{/dev/stdin} won't work.

There's nothing that says you have to read the data after you've started
sending the POST. Why not just read the --post-file before constructing the
request so that you know how big it is?

 My first impulse was to bemoan Wget's antiquated HTTP code which
 doesn't understand chunked transfer.  But, coming to think of it,
 even if Wget used HTTP/1.1, I don't see how a client can send chunked
 requests and interoperate with HTTP/1.0 servers.

How do browsers figure out whether they can do a chunked transfer or not?

Tony



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

 I don't understand what you're proposing.  Reading the whole file in
 memory is too memory-intensive for large files (one could presumably
 POST really huge files, CD images or whatever).

I was proposing that you read the file to determine the length, but that was
on the assumption that you could read the input twice, which won't work with
the example you proposed.

 It would be really nice to be able to say something like:

 mkisofs blabla | wget http://burner/localburn.cgi --post-file
 /dev/stdin

Stefan Eissing wrote:

 I just checked with RFC 1945 and it explicitly says that POSTs must
 carry a valid Content-Length header.

In that case, Hrvoje will need to get creative. :-)

Can you determine if --post-file is a regular file? If so, I still think you
should just read (or otherwise examine) the file to determine the length.

For other types of input, perhaps you want write the input to a temporary
file.

Tony



Re: Using chunked transfer for HTTP requests?

2003-10-07 Thread Tony Lewis
Hrvoje Niksic wrote:

 That would work for short streaming, but would be pretty bad in the
 mkisofs example.  One would expect Wget to be able to stream the data
 to the server, and that's just not possible if the size needs to be
 known in advance, which HTTP/1.0 requires.

One might expect it, but if it's not possible using the HTTP protocol, what
can you do? :-)



Re: Web page source using wget?

2003-10-06 Thread Tony Lewis
Suhas Tembe wrote:

 1). I go to our customer's website every day  log in using a User Name 
 Password.
[snip]
 4). I save the source to a file  subsequently perform various tasks on
 that file.

 What I would like to do is automate this process of obtaining the source
 of a page using wget. Is this possible?

That depends on how you enter your user name and password. If it's via using
an HTTP user ID and password, that's pretty easy.

wget
http://www.custsite.com/some/page.html --http-user=USER --http-passwd=PASS

If you supply your user ID and password via a web form, it will be tricky
(if not impossible) because wget doesn't POST forms (unless someone added
that option while I wasn't looking. :-)

Tony



Re: Option to save unfollowed links

2003-10-01 Thread Tony Lewis
Hrvoje Niksic wrote:

 I'm curious: what is the use case for this?  Why would you want to
 save the unfollowed links to an external file?

I use this to determine what other websites a given website refers to.

For example:
wget
http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/
 -
-mirror -np --unfollowed-links=hayward.out

By looking at hayward.out, I have a list of all websites that the directory
refers to. When I use this file, I sort it and throw away the Google and
DMOZ links. Everything else is supposed to be something interesting about
Hayward.

Tony



Re: Reminder: wget has no maintainer

2003-08-14 Thread Tony Lewis
Daniel Stenberg wrote:

 The GNU project is looking for a new maintainer for wget, as the current
one
 wishes to step down.

I think that means we need someone who:

1) is proficient in C
2) knows Internet protocols
3) is willing to learn the intricacies of wget
4) has the time to go through months' worth of email and patches
5) expects to have time to continue to maintain wget

Anyone here think they fit that bill?

(Feel free to add to my suggestions about what kind of person we need.)

Tony



Re: wget problem

2003-07-03 Thread Tony Lewis
Rajesh wrote:

 Wget is not mirroring the web site properly. For eg it is not copying
symbolic
 links from the main web server.The target directories do exist on the
mirror
 server.

wget can only mirror what can be seen from the web. Symbolic links will be
treated as hard references (assuming that some web page points to them).

If you cannot get there from http://www.sl.nsw.gov.au/ via your browser,
wget won't get the page.

Also, some servers change their behavior depending on the client. You may
need to use a user agent that looks like a browser to mirror some sites. For
example:

wget --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

will make it look like wget is really Internet Explorer running on Windows
XP.

 Another problem is some of the files are different on the mirror web
server.
 her you again. For eg: compare these 2 attached files.

 penrith1.cfm is the file after wget copied from the main server.
 penrith1.cfm.org is the actual file sitting on the main server.

wget is storing what the web server returned, which may or may not be the
precise file stored on your system.

In particular, I notice that penrith1.cfm contains !--Requested: 17:30:40
Thursday 3 July 2003 --. That implies that all or part of the output is
generated programmatically.

You might try using wget to replicate an FTP version of the website.

Then again, perhaps wget is the wrong tool for your task. Have you
considered using secure copy (scp) instead?

HTH,

Tony



Re: wget problem

2003-07-03 Thread Tony Lewis
Rajesh wrote:

 Thanks for your reply. I have tried using the command wget
 --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1), but it
didn't
 work.

Adding the user agent helps some people -- I think most often with web
servers from the evil empire.

 I have one more question. In each directory I have a welcome.cfm file on
the
 main server (DirectoryIndex order is welcome.cfm welcome.htm welcome.html
 index.html). But, when I run wget on the mirror server, wget renames
welcome.cfm
 to index.html and downloads to mirror server.

 Why does it change the file name from welcome.cfm to index.html.

It appears to me that wget assumes that the result of getting a directory
(such as http://www.sl.nsw.gov.au/collections/) is index.html. (See the
debug output below.)

 How can I mirror a web site using scp?? I can only copy one file at a time
using
 scp.

The following works for me: scp [EMAIL PROTECTED]:path/to/directory/* -r


**
The promised debug output:

wget http://www.sl.nsw.gov.au/collections  --debug
DEBUG output created by Wget 1.8.1 on linux-gnu.

--20:16:36--  http://www.sl.nsw.gov.au/collections
   = `collections'
Resolving www.sl.nsw.gov.au... done.
Caching www.sl.nsw.gov.au = 192.231.59.40
Connecting to www.sl.nsw.gov.au[192.231.59.40]:80... connected.
Created socket 3.
Releasing 0x810dc38 (new refcount 1).
---request begin---
GET /collections HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.sl.nsw.gov.au
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 301 Moved Permanently
Date: Fri, 04 Jul 2003 03:16:36 GMT
Server: Apache/1.3.19 (Unix)
Location: http://www.sl.nsw.gov.au/collections/
Connection: close
Content-Type: text/html; charset=iso-8859-1


Location: http://www.sl.nsw.gov.au/collections/ [following]
Closing fd 3
--20:16:37--  http://www.sl.nsw.gov.au/collections/
   = `index.html'
Found www.sl.nsw.gov.au in host_name_addresses_map (0x810dc38)
Connecting to www.sl.nsw.gov.au[192.231.59.40]:80... connected.
Created socket 3.
Releasing 0x810dc38 (new refcount 1).
---request begin---
GET /collections/ HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.sl.nsw.gov.au
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 200 OK
Date: Fri, 04 Jul 2003 03:16:37 GMT
Server: Apache/1.3.19 (Unix)
Connection: close
Content-Type: text/html; charset=iso-8859-1


Length: unspecified [text/html]

[
] 21,28420.83K/s

Closing fd 3
20:16:38 (20.83 KB/s) - `index.html' saved [21284]



wget is smarter than Internet Explorer!

2003-06-13 Thread Tony Lewis
I tried to retreive a URL with Internet Explorer and it continued to
retrieve the URL forever. I tried to grab that same URL with wget, which
tried twice and then reported redirection cycle detected.

Perhaps we should send the wget code to someone in Redmond.

Tony



Re: Comment handling

2003-06-05 Thread Tony Lewis
Aaron S. Hawley wrote:

 why not just have the default wget behavior follow comments explicitly
 (i've lost track whether wget does that or needs to be ammended) /and/
 have an option that goes /beyond/ quirky comments and is just
 --ignore-comments ? :)

The issue we've been discussing is what to do about things that almost
follow the rules for HTML comments, but don't quite get it right. By
default, wget ignores legitimate HTML comments.

Tony



Re: Comment handling

2003-06-05 Thread Tony Lewis
Aaron S. Hawley wrote:

 i'm just saying what's going to happen when someone posts to this list:
 My Web Pages have [insert obscure comment format] for comments and Wget
 is considering them to (not) be comments.  Can you change the [insert
 Wget comment mode] comment mode to (not) recognize my comments?

One way to implement quirky comments is to allow the user to add their own
comment format to the wgetrc file.

Tony



Re: Comment handling

2003-06-03 Thread Tony Lewis
Georg Bauhaus wrote:


 I don't think so. Actually the rules for SGML comments are
 somewhat different.

Georg, I think we're talking about apples and oranges here. I'm talking
about what is legitimate in a comment in an SGML document. I think you're
talking about what is legitimate as a comment in an SGML declaration.

At any rate, I decided to do some more poking around. I wrote a web page
(see http://www.exelana.com/comments.html) with the following variations on
comments:
!-- Comment --
!-- -- --
!
!

The browsers I tried (Internet Explorer, Mozilla, and Lynx) ignore all of
them. I also tried the W3C Markup Validation Service at
http://validator.w3.org/

It reported that the last one is not valid:

Line 22 column 8: comment started here
!

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.exelana.com%2Fcomments.htmldoctype=HTML+2.0charset=us-ascii+%28basic+English%29

The moral of the story: one cannot evaluate an HTML document solely on what
any browser (or even all of them) do with it.

Tony



Re: Comment handling

2003-06-01 Thread Tony Lewis
George Prekas wrote:

 You are probably right. I have pointed this because I have seen pages that
 use as a separator !-- with lots of dashes and althrough
 Internet Explorer shows the page, wget can not download it correctly. What
 do think about finishing the comment at the ?

After reading http://www.w3c.org/MarkUp/SGML/sgml-lex/sgml-lex I am
convinced that !- is a valid SGML (and therefore HTML) comment.
Therefore, I believe it is a bug if wget does not recognize such a comment.

Note: I haven't studied the source to confirm how it handles such a string.

Tony



Re: Comment handling

2003-05-31 Thread Tony Lewis
George Prekas wrote:


 I have found a bug in Wget version 1.8.2 concerning comment handling (
!--
 comment -- ). Take a look at the following illegal HTML code:
 HTML
 BODY
 a href=test1.htmltest1.html/a
 !--
 a href=test2.htmltest2.html/a
 !--
 /BODY
 /HTML

 Now, save the above snippet as test.html and try wget -Fi test.html. You
 will notice that it doesn't recognise the second link. I have found a
 solution to the above situation and have properly patched html-parse.c and
I
 would like some info on how can I give you the patch.

The HTML code is legitimate, but it only contains one link. The following
three lines constitute a single comment:

!--
a href=test2.htmltest2.html/a
!--

A comment begins at !-- and ends at --. The trailing  on the first
of these lines and the leading ! on the third of these lines are part of
the comment. That is, the comment text is:


a href=test2.htmltest2.html/a
!

At any rate, one should not expect predictable behavior for broken HTML.
What should wget do with the following?

a href=test1.htmltest1.html
!--
/a
!--

In one version, it might choose to follow the link to test1.html and in
another version it might not.

Tony



Re: Cannot get wildcards to work ??

2003-03-28 Thread Tony Lewis
Dick Penny wrote:


 I have just successfully used WGET on a single file download. I even
figured
 out how to specify a destination.  But, I cannot seem to get wildcards
to
 work. Help please:
 wget  -o log.txt -P c:/Documents and Settings/Administrator/My
 Documents/CME_data/bt
 ftp://ftp.cme.com/pub/bulletin/historical_data/bt02100?.zip

You requested the resource bt02100 with a query string of .zip. You
might just as easily have asked for bt02100.cgi?extension=zip. When it
appears in a URL, the question mark is not a wildcard character; it is a
separator between the resource and the query string.

Chances are very good that the server doesn't have a clue how to process
such a request.

Tony



Re: Static Mirror of DB-Driven Site

2003-03-17 Thread Tony Lewis
Dan Mahoney, System Admin wrote:

 Assume I have a site that I want to create a static mirror of.  Normally
 this site is database driven, but I figure if I spider the entire site,
 and map all the GET URLS to static urls I can have a full mirror.  Has
 anyone known of this being successfully done?  How would I get apache to
 see the page names as full names (for example a page named
 exec.pl?name=blahfoo=bar actually being a file rather than a command?)

Wget should already do what you want (provided that the file system where
you will be mirroring the results can handle things like ?, =, and 
in a file name). Wget does not care how Apache processes a URL; it only
cares that when it does a GET of a URL that some object is returned.

The issue for you will be making sure that all the things you want to mirror
are referenced as links on the site. How does a person visiting your site
know that blah is a valid value for name or that bar is a valid value
for foo? If they learn this by clicking on a link, then everything should
work as you want.

However, if the user must supply the value for name and foo (perhaps by
entering them in a form) then there is no way for wget to know those values.
If that is the case, you will have to construct your own list of URLs with
all the combinations of name and foo that you want to mirror.

HTH.

Tony



Re: conditional url encoding

2003-02-22 Thread Tony Lewis
Ryan Underwood wrote:

 It seems that some servers are broken and in order to fetch files with
certain
 filenames, some characters that are normally encoded in HTTP sequences
must
 be sent through unencoded.  For example, I had a server the other day that
 I was fetching files from at the URL:
 http://server.com/~foobar/files

I'm having a hard time figuring out why wget is encoding the tilde in the
first place. They way I read RFC 2396, tilde is one of several marks that
are not encoded. The complete set of marks defined in RFC 2396 is
-_.!~*'().

Perhaps the encoding rules in wget were written prior to the publication of
RFC 2396 and are based on the national character discussion of RFC 1630.
If so, tilde is the only character that was defined as national in RFC
1630 and as a mark in RFC 2396.

For what it's worth, the national characters in RFC 1630 are {}|[]\^~.

Tony



Re: image tags not read

2003-01-04 Thread Tony Lewis
Johannes Berg wrote:

 Maybe this isn't really a bug in wget but rather in the file, but since
 this is standard as exported from MS Word I'd like to see wget recognize
 the images and download them.

Microsoft Word claims to create a valid HTML file. In fact, what it creates
can only reliably be read by Internet Explorer. (It may even only be read by
recent versions of Internet Explorer.) The file that it produces contains a
number of proprietary tags as well as proprietary variations of standard
HTML that only Microsoft understands.

wget has a simple HTML parser that cannot understand these variations. While
there may be someone who is interested in patching the wget parser to deal
with Word's pseudo-HTML, I doubt that such changes would ever become part of
a standard wget release.

You might have better luck finding someone who is willing to write a program
to convert Word's pseudo-HTML into real HTML that can be read by most HTML
parsers. Since you're in an academic setting, your odds of finding someone
willing to do this kind of program might be higher. Good luck.

Tony




Re: ralated links in javascripts script

2002-12-16 Thread Tony Lewis
cyprien wrote:

 I want to mirror my homesite, everything works fine expect one :
 my site is a photo site based on php scipts : gallery
 (http://gallery.sourceforge.net)
 it have also some javascripts script...
[snip]
 what can i do to have that (on mirror site) :

You cannot because wget does not parse JavaScript. It only finds links in
the HTML.

Tony




Re: newbie doubts

2002-12-04 Thread Tony Lewis
Nandita Shenvi wrote:

 I have not copied the whole script but just the last few lines.The
variable
 $all_links[3] has an URL:
 http://bolinux39.europe.nokia.com/database2/MIDI100/GS001/01FINALC.MID.
 the link follows a file, which I require.
 I remove the http:// before calling the wget, but i still get an error
 message:

 --13:56:24--

http://%20bolinux39.europe.nokia.com/database2/MIDI100/GS001/01FINALC.MID%0A
= `wgetcheck'

$all_links[3] needs to be cleaned up. It contains as trailing \n and there
is a space between http://; and bolinux39 that should not be there.

The \n is easily addressed by: chomp @all_links;

You should look at your script to determine how the space got there. You can
get rid of all spaces by: $all_links[3] =~ s/ //g;
but that may not be what you want. You're better off figuring out how the
unwanted space got there in the first place and making sure it doesn't.

Tony




Re: Virus messages .....

2002-05-06 Thread Tony Lewis

Frank Helk wrote:

 Free (web based) scanning is available at http://www.antivirus.com.
 Select Free tools in the top menu and then Scan Your PC, Free from
 the list. You'll not even have to register to use it. Please.

It may not be so simple. Klez uses anti-anti-virus techniques to prevent
itself from being detected or deleted. It took me several days to figure out
how to eradicate it from my computer. The only way I was able to get rid of
it was by booting my computer in Safe mode and installing the anti-virus
software from a CD. If one runs something that has been downloaded from the
Internet, there is a strong possibility that Klez will infect it before it
can do its detection or cleaning.

I agree with Frank. If you're running Windows (particularly if you're using
Outlook and/or have opened attachments sent to wget lists recently), you
need to scan (and probably clean) your computer.

Tony




Re: (Extended) Reading commandline option values from files or file descriptors (for wget v1.8.1)

2002-04-29 Thread Tony Lewis

Herold Heiko wrote:

 It would be better imho if the options itself are modified, in that case
the
 variable option wouldn't be necessary, supposing we keep the  and :, this
 could be
 --@http-passwd=passwd.txt --:proxy-passwd=0

It seems to me that a convention like this should be adopted (or rejected)
across a wide range of applications. Is there a GNU-wide mailing list where
this could be proposed and discussed?

Tony




Re: Virus mails

2002-04-27 Thread Tony Lewis

Brix Lichtenberg wrote:

 But I'm still getting three or more virus mails with attachments 100k+
 daily from the wget lists and they're blocking my mailbox (dial-up). And
 getting those dumb system warnings accompanying them doesn't make it
 better. Isn't there really no way to stop that (at least disallow
 attachments)? Patches and such still can be pasted into the text,
 can't they?

I agree. Why not treat any mail with an attachment as suspect. Let the
moderators approve any valid messages.

Tony




Re: wget does not honour content-length http header [http://bugs.debian.org/143736]

2002-04-25 Thread Tony Lewis

Hrvoje Niksic wrote:

 If your point is that Wget should print a warning when it can *prove*
 that the Content-Length data it received was faulty, as in the case of
 having received more data, I agree.  We're already printing a similar
 warning when Last-Modified is invalid, for example.

I'm afraid you'll have to ask R. Fielding, J. Gettye, J. Mogul, H. Frystyk,
and T. Berners-Lee what they were thinking. grin I was just quoting from
RFC 2068: Hypertext Transfer Protocol -- HTTP/1.1

As for printing a warning only when wget can prove that the Content-Length
data was faulty, sounds like a reasonable implementation to me.

Tony




Re: apache irritations

2002-04-22 Thread Tony Lewis

Maciej W. Rozycki wrote:

  Hmm, it's too fragile in my opinion.  What if a new version of Apache
 defines a new format?

I think all of the expressions proposed thus far are too fragile. Consider
the following URL:

http://www.google.com/search?num=100q=%2Bwget+-GNU

The regular expression needs to account for multiple arguments separated by
ampersands. It also needs to account from any valid URI character between an
equal sign and either end of string or an ampersand.

I'm not fluent enough in regular expressions to compose one myself. (Some
day I'll absorb all of Friedl's Mastering Regular Expressions, but not
today.)

Tony




Re: apache irritations

2002-04-22 Thread Tony Lewis

Maciej W. Rozycki wrote:

  I'm not sure what you are referring to.  We are discussing a common
 problem with static pages generated by default by Apache as index.html
 objects for server's filesystem directories providing no default page.

Really? The original posting from Jamie Zawinski said:

 I know this would be somewhat evil, but can we have a special case in
 wget to assume that files named ?N=D and index.html?N=D are the same
 as index.html?  I'm tired of those dumb apache sorting directives
 showing up in my mirrors as if they were real files...

I understood the question to be about URLs containing query strings (which
Jamie called sorting directives) showing up as separate files. I thought the
discussion was related to that topic. Maybe it diverged from that later in
the chain and I missed the change of topic.

I think what Jamie wants is one copy of index.html no matter how many links
of the form index.html?N=D appear.

  BTW, wget's accept/reject rules are not regular expressions but simple
 shell globbing patterns.

OK.

Tony




Re: HTTP 1.1

2002-04-12 Thread Tony Lewis

Hrvoje Niksic wrote:

  Is there any way to make Wget use HTTP/1.1 ?

 Unfortunately, no.

In looking at the debug output, it appears to me that wget is really sending
HTTP/1.1 headers, but claiming that they are HTTP/1.0 headers. For example,
the Host header was not defined in RFC 1945, but wget is sending it.

Tony




Re: Current download speed in progress bar

2002-04-09 Thread Tony Lewis

Hrvoje Niksic wrote:

 The one remaining problem is the ETA.  Based on the current speed, it
 changes value wildly.  Of course, over time it is generally
 decreasing, but one can hardly follow it.  I removed the flushing by
 making sure that it's not shown more than once per second, but this
 didn't fix the problem of unreliable values.

I'm often annoyed by ETA estimates that make no sense. How about showing two
values -- something like:

ETA at average speed: 1:05:17
ETA at current speed: 15:05

Then the user can decide which value is more meaningful. In addition, it
gives feedback about the current speed versus the average.

Tony




Re: Current download speed in progress bar

2002-04-09 Thread Tony Lewis

Hrvoje Niksic wrote:

  I'll grab the other part and explain what curl does. It shows a
current
  speed based on the past five seconds,

 Does it mean that the speed doesn't change for five seconds, or that
 you always show the *current* speed, but relative to the last five
 seconds?  I may be missing something, but I don't see how to efficiently
 implement the latter.

Could you keep an array of speeds that is updated once a second such that
the value from six seconds ago is discarded and when the value for the
second that just ended is recorded?

Tony




Re: Referrer Faking and other nifty features

2002-04-03 Thread Tony Lewis

Andre Majorel wrote:

  Yes, that allows me to specify _A_ referrer, like www.aol.com.  When I'm
  trying to help my users mirror their old angelfire pages or something
like
  that, very often the link has to come from the same directory.  I'd like
  to see something where when wget follows a link to another page, or
  another image, it automatically supplies the URL of the page it followed
  to get there.  Is there a way to do this?

 Somebody already asked for this and AFAICT, there's no way to do
 that

Not only is it possible, it is the behavior (at least in wget 1.8.1). If you
run with -d, you will see that every GET after the first one includes the
appropriate referer.

If I execute: wget -d -r http://www.exelana.com --referer=http://www.aol.com

The first request is reported as:
GET / HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.exelana.com
Accept: */*
Connection: Keep-Alive
Referer: http://www.aol.com

But, the third request is:
GET /left.html HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.exelana.com
Accept: */*
Connection: Keep-Alive
Referer: http://www.exelana.com/

The second request is for robots.txt and uses the referer from the command
line.

Tony




Re: wget parsing JavaScript

2002-03-27 Thread Tony Lewis

Ian Abbott wrote:

 For example, a recursive retrieval on a page like this:

 html
   body
 script
   a href=foo.htmlfoo/a
 /script
   /body
 /html

 will retrieve foo.html, regardless of the script.../script
 tags.

We seem to be talking about two completely different things, Ian. A page
that looks like this:

html
head
script
top.location = foo.html;
/script
/head
body
This page transfers to foo.
/body
/html

won't retrieve foo.html.

That's what I have been trying to get across.

Tony




Re: wget parsing JavaScript

2002-03-26 Thread Tony Lewis

Csaba Ráduly wrote:

 I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to
 parse whatever it's inside.
 Why was this implemented ? JavaScript is most
 used to construct links programmatically. wget is likely to find
 bogus URLs until it can properly parse JavaScript.

wget is parsing the attributes within the script tag, i.e., script
src=url. It does not examine the content between script and /script.

It looks for src=url because the source file is just another file that may
need to be copied (along with all the other files that are needed to mirror
a site).

Tony





Re: wget parsing JavaScript

2002-03-26 Thread Tony Lewis

I wrote:

  wget is parsing the attributes within the script tag, i.e., script
  src=url. It does not examine the content between script and
/script.

and Ian Abbott responded:

 I think it does, actually, but that is mostly harmless.

You're right. What I meant was that it does not examine the JavaScript
looking for URLs.

Tony




Re: (Fwd) Automatic posting to forms

2002-03-08 Thread Tony Lewis

Daniel Stenberg responded to my original suggestion:

  With this information, any time that wget encounters a form whose action
is
  /cgi-bin/auth.cgi, it will enqueue the submission of the form using
the
  values provided for the fields id and pw.

 Now, why would wget do this?

There are many examples of sites that require the user to post a form to
access other parts of the site -- sometimes the post contains user-supplied
data and sometimes it doesn't. If one wants to grab everything on the other
side of that form, having wget post the form seems like the way to get
there.

 Yes, probably: when the form tags contains enctype='multipart/form-data'
 you need to build an entirely different data stream (RFC1867 is the key
 here).

You're right. I had not yet thought about that flavor of posting.

 I'd also like to point out that curl already supports both regular HTTP
POST
 as well as multipart formposts.

Unless I'm misreading the curl manual, that only allows me to get one page.
However, I've never been inclined to invent wheels when I can download them.
I will study the curl source code related to posting before I wander too far
down this path.

Tony





<    1   2