Wget web site broken

2004-12-03 Thread Aaron S. Hawley
There are old links to Wget on the web pointing to:
http://www.gnu.org/software/wget/wget.html

The FSF people have a nice symlink system for package web sites.  Simply
add a file called `.symlinks' to Wget's CVS web repository with the
following line:

index.html wget.html

Or rename the file to wget.html and reverse the order of the above line.

Thanks,
/a


Re: wget 1.9.1

2004-10-18 Thread Aaron S. Hawley
On Mon, 18 Oct 2004, Gerriet M. Denkmann wrote:

 So - is this a bug, did I misunderstand the documentation, did I use
 the wrong options?

Reasonable request.  You just couldn't find the archives:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg06626.html

more:
http://google.com/search?q=site%3Amail-archive.com+wget+css

/a


Re: Missing precis man wget info about -D

2004-09-03 Thread Aaron S. Hawley
at the bottom of the man page it says:

SEE ALSO
   GNU Info entry for wget.

this is a cryptic suggestion to type the following on the command-line:

info wget

this will give you the GNU Wget user manual where you'll find clear
examples.

info is the documentation format most all GNU software manuals are
distributed with, and should be installed on gentoo.

/a

On Fri, 3 Sep 2004, Karel [iso-8859-1] Kulhavý wrote:

 Hello

 man wget (wget 1.9) Says that -D is followed by comma separated list of
 domains, but doesn't say format of these domains. Should http:// be included?
 Can the domains include also the path in the URL, i. e.
 http://images.twibright.com/tns ?

 Cl


Re: --spider parameter

2004-02-11 Thread Aaron S. Hawley
Some sort of URL reporting facility is on the unspoken TODO list.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05282.html

/a

On Wed, 11 Feb 2004, Olivier SOW wrote:

 hi,

 I use Wget to check page state with the --spider parameter
 I looking for a way to get back only the number server response (200 if OK,
 404 if missing, ...)
 but I don't found a simple way.
 So i try to write the result to a file and parse it but there is no standard
 output for each response

 can you add a parameter to get only the number or standardize the reponse
 (no return carriage and number for an autorization failed)

 thanx for your work ;)


RE: GNU TLS vs. OpenSSL

2003-11-05 Thread Aaron S. Hawley
From Various Licenses and Comments about Them
http://www.gnu.ctssn.com/licenses/license-list.html

The OpenSSL license.
The license of OpenSSL is a conjunction of two licenses, One of them
being the license of SSLeay. You must follow both. The combination results
in a copyleft free software license that is incompatible with the GNU GPL.
It also has an advertising clause like the original BSD license and the
Apache license.

We recommend using GNUTLS instead of OpenSSL in software you write.
However, there is no reason not to use OpenSSL and applications that work
with OpenSSL.

shouldn't the above say say non-copyleft?
/a

On Wed, 5 Nov 2003, Post, Mark K wrote:

 I'm a little confused.  OpenSSL is licensed pretty much the same as Apache.
 What's the GPL issue with that style of license?


 Mark Post

 -Original Message-
 From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 05, 2003 8:27 AM
 To: [EMAIL PROTECTED]
 Subject: GNU TLS vs. OpenSSL


 Does anyone know what the situation is regarding the use of GNU TLS
 vs. OpenSSL?  Last time I checked (several years ago), GNU TLS was
 still in beta; also, it was incompatible with OpenSSL.

 Is GNU TLS usable now?  Should Wget try to use it in preference to
 OpenSSL and thus get rid of the GPL exception?


Re: [wget] OT: White House site prevents Iraq material being archived (fwd)

2003-11-03 Thread Aaron S. Hawley

-- Forwarded message --
From: dvanhorn [EMAIL PROTECTED]
To: list for the ballistic helmet heads [EMAIL PROTECTED]
Date: Mon, 03 Nov 2003 14:13:30 -0500
Subject: Re: [ballistichelmet] White House site prevents Iraq material
being archived
Reply-To: list for the ballistic helmet heads
[EMAIL PROTECTED]:4000
User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1) Gecko/20020827
Precedence: list
Return-Path: [EMAIL PROTECTED]

On Thu, 30 Oct 2003, Aaron S. Hawley wrote:

 [for those robots.txt fans]

 White House site prevents Iraq material being archived
 http://www.theage.com.au/articles/2003/10/28/1067233141495.html
 By Sam Varghese
 October 28, 2003

The robots.txt convention for preventing material from being archived is
merely a courtesy and is not legally binding.  The citizens of the US have
a right to the information that the administration hosts on
whitehouse.gov, therefore I've downloaded a complete mirror of the
whitehouse.gov site, complete with all the directories that are denied in
robots.txt.  Our copy of the site will be fully indexed by search engines.

http://ballistichelmet.org/pigstate/

In order to facilitate search engine indexing I've made a file that
includes links to every file within the wh.gov site.  By including the
link to that file here, search engines should pick up the material
shortly.

http://ballistichelmet.org/pigstate/www.whitehouse.gov.ls.html

(This file doesn't exist yet, but will soon.)

NB This copy of the site is not really meant for human viewing since many
of the links will be broken and images won't be shown.  This is really
only for the sake of search engines being able to index the material from
wh.gov.

I would like to fetch a mirror of the site on a daily or weekly basis and
I have plans to provide diff files that show, on a line for line basis,
the changes that have occured with each new version of wh.gov.  We have
unlimited disk space on our new web host and we could potentially do this
with many more goverment sites.

If you are a site administrator and would like to mirror wh.gov, the
command is:

wget --mirror www.whitehouse.gov --relative -e robots=off

I'd like to thank the GNU Project for providing high quality software to
the public, especially GNU Wget.

(The robots.txt convention was never meant to be an access control mechanism.
  It is intended to mark directories that site administrators don't want web
crawling programs to descend into because of the burden it puts on the site.
Since we host a copy of wh.gov, and we don't mind the burden, we can get rid
of robots.txt.)

-d

___
heads mailing list
[EMAIL PROTECTED]:4000
http://ballistichelmet.org/mailman/listinfo/heads


[wget] OT: White House site prevents Iraq material being archived

2003-10-30 Thread Aaron S. Hawley
[for those robots.txt fans]

White House site prevents Iraq material being archived
http://www.theage.com.au/articles/2003/10/28/1067233141495.html
By Sam Varghese
October 28, 2003

The White House website http://www.whitehouse.gov/ effectively prevents
search engines indexing and archiving material on the site related to
Iraq.

The directories on a site which can be searched by the bots sent out by
search engines can be limited by means of a file called robots.txt, which
resides in the root directory of a site.

Adding a directory to robots.txt ensures that nothing in that folder will
ever show up in a search and will never be archived by search sites.

The White House's robots.txt http://www.whitehouse.gov/robots.txt file
lists a huge number of directories all related to Iraq.

The Democrat National Committee
bloghttp://www.democrats.org/blog/display/00010130.html claims a change
in the robots.txt file took place sometime between April and October this
year.

Earlier this year, the White House changed pages on its website which
claimed that combat was over in Iraq; these pages were changed to say
major combat.

These changes were noticed and proved by
readershttp://www.differentstrings.info/archives/002813.htmlbecause
Google had archived them before the changes were made.

With the new robots.txt file, any future changes will be extremely
difficult to spot - and even more difficult to prove.


Re: wget downloading a single page when it should recurse

2003-10-17 Thread Aaron S. Hawley
The HTML of those pages contains the meta-tag

meta name=robots content=noindex,nofollow /

and Wget listened, and only downloaded the first page.

Perhaps Wget should give a warning message that the file contained a
meta-robots tag, so that people aren't quite so dumb-founded.

/a

On Fri, 17 Oct 2003, Philip Mateescu wrote:

 Hi,

 I'm having a problem with wget 1.8.2 cygwin and I'm almost ready to
 swear it once worked...

 I'm trying to download the php manual off the web using this command:

 $ wget -nd -nH -r -np -p -k -S http://us4.php.net/manual/en/print/index.php

-- 
Consider supporting GNU Software and the Free Software Foundation
By Buying Stuff - http://www.gnu.org/gear/
  (GNU and FSF are not responsible for this promotion
   nor necessarily agree with the views of the author)


Re: possible bug in exit status codes

2003-09-15 Thread Aaron S. Hawley
I can verify this in the cvs version.
it appears to be isolated to the recursive behavior.

/a

On Mon, 15 Sep 2003, Dawid Michalczyk wrote:

 Hello,

 I'm having problems getting the exit status code to work correctly in
 the following scenario. The exit code should be 1 yet it is 0


DeepVaccum

2003-09-13 Thread Aaron S. Hawley
[saw this on the web..]

HexCat Software DeepVaccum
http://www.hexcat.com/deepvaccum/

DeepVaccum is a donationware, useful web utility based on GNU wget command
line tool. Program includes a vast number of options to fine tune your
downloads through both http and ftp protocols.

DV enables you to download:  whole single pages, entire sites, ftp
catalogs, link lists from a text file, filtered types, ex. images.

-- 
 Too much use of American power overseas causes the nation to look
  like 'the ugly American'. -- Gov. George W. Bush


Re: suggestion

2003-09-12 Thread Aaron S. Hawley
is -nv (non-verbose) an improvement?

$ wget -nv www.johnjosephbachir.org/
12:50:57 URL:http://www.johnjosephbachir.org/ [3053/3053] - index.html [1]
$ wget -nv www.johnjosephbachir.org/m
http://www.johnjosephbachir.org/m:
12:51:02 ERROR 404: Not Found.

but if you're not satisfied you could use shell redirection and the tail
command:

$ wget -nv www.johnjosephbachir.org/m 21  /dev/null | tail +2

you could use the return value of error to echo would ever you want.
$ wget -q www.johnjosephbachir.org/m || echo Error
Error


On Fri, 12 Sep 2003, John Joseph Bachir wrote:

 it would be great if there was a flag that could be used with -q that
 would only give output if there was an error.

 i use wget a lot in pcs:

   johnjosephbachir.org/pcs

 thanks!
 john


Re: wget --spider issue

2003-09-10 Thread Aaron S. Hawley
On Wed, 10 Sep 2003, Andreas Belitz wrote:

 Hi,

   i have found a problem regarding wget --spider.

   It works great for any files over http or ftp, but as soon as one of
   these two conditions occur, wget starts downloading the file:

   1. linked files (i'm not 100% sure about this)
   2. download scripts (i.e. http://www.nothing.com/download.php?file=12345;)

   i have included one link that starts downloading even if using the
   --spider option:

   
 http://club.aopen.com.tw/downloads/Download.asp?RecNo=3587Section=5Product=MotherboardsModel=AX59%20ProType=ManualDownSize=8388
   (MoBo Bios file);

   so this actually starts downloading:

   $ wget --spider 
 'http://club.aopen.com.tw/downloads/Download.asp?RecNo=3587Section=5Product=MotherboardsModel=AX59%20ProType=ManualDownSize=8388'

actually, what you call download scripts are actually HTTP redirects, and
in this case the redirect is to an FTP server and if you double-check i
think you'll find Wget does not know how to spider in ftp.  end
run-on-sentence.

   If there is no conlclusion to this problem using wget can anyone
   recommend another Link-Verifier? What i want to do is: check the
   existence of som 200k links store in a database. So far i was trying
   to use /usr/bin/wget --spider \' . $link . \' 21 | tail -2 | head
   -1 in a simple php script.

I do something similar with Wget (using shell scripting instead), and I am
pleased with the outcome.  Since you are calling Wget for each link and if
you note that Wget does a good job of returning success or failure, you
can actually do this..

wget --spider '$link' || echo '$link'  badlinks.txt

I can send you my shell scripts if you're interested.
/a


   Thanks for any help!

-- 
Our armies do not come into your cities and lands as conquerors or
enemies, but as liberators.
 - British Lt. Gen. Stanley Maude. Proclamation to the People of the
   Wilayat of Baghdad. March 8, 1917.


Re: wget --spider issue

2003-09-10 Thread Aaron S. Hawley

On Wed, 10 Sep 2003, Andreas Belitz wrote:

 Hi Aaron S. Hawley,

 On Wed, 10. September 2003 you wrote:

 ASH actually, what you call download scripts are actually HTTP redirects, and
 ASH in this case the redirect is to an FTP server and if you double-check i
 ASH think you'll find Wget does not know how to spider in ftp.  end
 ASH run-on-sentence.

 Ok. This seems to be the reason. Thanks. Is there any way to make wget
 spider ftp adresses?

I sent a patch to this list over the winter.  it's included with the shell
scripts i spoke of and have attached to this message.

 ASH I can send you my shell scripts if you're interested.
 ASH /a

 That would be great!

gnurls-0.1.tar.gz
Description: Binary data


Re: Help needed! How to pass XML message to webserver

2003-09-09 Thread Aaron S. Hawley
Wget doesn't currently have http file upload capabilities, but if this XML
message can be sent by cgi POST parameters then Wget could probably do it.
but you'll need to figure out how exactly the XML message is sent using
http.

/a

On Mon, 8 Sep 2003, Vasudha Chiluka wrote:

 Hi ,

 I need to pass XML message to a webserver using http.
 Could anybody tell me how I can accomplish this using wget.
 Any help is greatly appreciated..

 Thanks
 Vasudha


Re: [SPAM?:###] RE: wget -r -p -k -l 5 www.protcast.com doesnt pull some images t hough they are part of the HREF

2003-09-09 Thread Aaron S. Hawley
I, on the other hand, am actually not sure why you're not able to
have Wget find the marked up (not javascript) image.

Cause it worked for me.

% ls -l www.protcast.com/Grafx/menu-contact_\(off\).jpg
-rw---   1 ashawley usr 2377 Jan 10 2003  
www.protcast.com/Grafx/menu-contact_(off).jpg

/a

On Tue, 9 Sep 2003, Post, Mark K wrote:

 No, it won't.  The javascript stuff makes sure of that.


 Mark Post

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, September 09, 2003 4:32 PM
 To: [EMAIL PROTECTED]
 Subject: wget -r -p -k -l 5 www.protcast.com doesnt pull some images
 though they are part of the HREF


 Hi,
 I am having some problems with  downloading  www.protcast.com.
 I used  wget -r -p -k -l 5 www.protcast.com
 In www.protcast.com/Grafx files  menu-contact_(off).jpg get downloaded.
 However menu-contact_(on).jpg does not get downloaded though it lies in the
 same directory as the
 menu-contact_(off).jpg file.

 index.html contains the following HREF

 A HREF=contact.htm
 ONMOUSEOVER=msover1('m-contact','Grafx/menu-contact_(on).jpg');
 ONMOUSEOUT=msout1('m-contact','Grafx/menu-contact_(off).jpg');
 IMG SRC=Grafx/menu-contact_(off).jpg NAME=m-contact WIDTH=197
 HEIGHT=29 BORDER=0/A

 so wget should be able to see this image right?.

 Please help/advice.
 bye


Re: How to force wget to download Java Script links?

2003-09-08 Thread Aaron S. Hawley
wget doesn't have a javascript interpreter.

On Mon, 8 Sep 2003, Andrzej Kasperowicz wrote:

 How to force wget to download Java Script links:

 http://znik.wbc.lublin.pl/ChemFan/kalkulatory/javascript:wrzenie():
 17:04:44 ERROR 404: Not Found.
 http://znik.wbc.lublin.pl/ChemFan/kalkulatory/javascript:cisnienia():
 17:04:45 ERROR 404: Not Found.

 Or maybe it can download it, but there is just an error on the web page?

 ak


Re: Reminder: wget has no maintainer

2003-08-14 Thread Aaron S. Hawley
On Tue, 12 Aug 2003, Tony Lewis wrote:

 Daniel Stenberg wrote:

  The GNU project is looking for a new maintainer for wget, as the
  current one wishes to step down.

 I think that means we need someone who:

 1) is proficient in C
 2) knows Internet protocols
 3) is willing to learn the intricacies of wget
 4) has the time to go through months' worth of email and patches
 5) expects to have time to continue to maintain wget

 Anyone here think they fit that bill?

 (Feel free to add to my suggestions about what kind of person we need.)

 2a) All of them. :)
 (Is familiar with application-level network protocols, web standards
 and their usage, correct or otherwise .  For an example see the HTML
 comments thread.  Knows (X)HTML and familiar with other web media
 formats.)
 6) Will move Wget sources to the savannah(.gnu.org)
 7) Has the patience to work with the Wget source which likely burnt out
   the previous maintainers.  smiley
 7) Understands the need to redesign Wget, someday.. soon smiley
 8) Supports free software!  whee!

it seems like there's a lot of hackers, bug-reporters, users and
supporters who could collectively accomplish a lot of the work, we really
just need an all-knowing philosopher-queen or king to not necessarily do a
whole lot of work, but just have an active CVS account.  so as technically
threatening as the Wget maintainer application looks, we could currently
most benefit from a trusted soul.

/a


Re: help with wget????

2003-08-14 Thread Aaron S. Hawley
searching the web i found out that cygwin has wget and there's also this:

http://kimihia.org.nz/projects/cygwget/

/a

On Wed, 13 Aug 2003, Shell Gellner wrote:

 Dear Sirs,

 I've downloaded the GNU software but when I try to run the WGET.exe file
 it keeps telling me 'is linked to missing export LIBEAY32.DLL:3212' and
 also ' A device attached to the system is not working'.

  I've scoured the different links trying to find some help with the setup
 but can find nothing.  Can you help??  Is there a source that covers setup of this 
 program
 for beginners like me?

 I'd really appreciate your help.
 Shell

 Guitar Musician
 http://www.guitarmusician.com


RE: -N option

2003-07-30 Thread Aaron S. Hawley
I guess I like Mark's --ignore-length strategy.  and it looks like this
could work with a fix to Wget found in this patch:

Index: src/ftp.c
===
RCS file: /pack/anoncvs/wget/src/ftp.c,v
retrieving revision 1.61
diff -u -c -r1.61 ftp.c
*** src/ftp.c   2003/01/11 20:12:35 1.61
--- src/ftp.c   2003/07/30 17:43:04
***
*** 1360,1369 
  tml++;
  #endif
/* Compare file sizes only for servers that tell us correct
!  values. Assumme sizes being equal for servers that lie
!  about file size.  */
cor_val = (con-rs == ST_UNIX || con-rs == ST_WINNT);
!   eq_size = cor_val ? (local_size == f-size) : 1 ;
  if (f-tstamp = tml  eq_size)
{
  /* Remote file is older, file sizes can be compared and
--- 1360,1370 
  tml++;
  #endif
/* Compare file sizes only for servers that tell us correct
!  values. Assume sizes being equal for servers that lie
!  about file size, or if givin the ignore length option */
cor_val = (con-rs == ST_UNIX || con-rs == ST_WINNT);
!   eq_size = cor_val
!!opt.ignore_length ? (local_size == f-size) : 1;
  if (f-tstamp = tml  eq_size)
{
  /* Remote file is older, file sizes can be compared and

On Tue, 29 Jul 2003, Post, Mark K wrote:

 Other than the --ignore-length option I mentioned previously, no.  Sorry.

 Mark Post

 -Original Message-
 From: Preston [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, July 29, 2003 7:01 PM
 To: [EMAIL PROTECTED]
 Subject: Re: -N option

 To answer questons asked so far:  We are using wget version 1.8.2
 I have checked the dates on the local file and the remote file and the
 local file date is newer.  The reason I thought it was still clobbering
 despite the newer date on the local was because of the size difference.
 I read that in the online manual here:
  http://www.gnu.org/manual/wget/html_chapter/wget_5.html#SEC22

 At the bottom it says,

 If the local file does not exist, or the sizes of the files do not
 match, Wget will download the remote file no matter what the time-stamps
 say.

 I do want newer files on the remote to replace older files on the local
 server.  Essentially, I want the newest file to remain on the local.
 The problem I am having, however is that if we change/update files on
 the local, if they are of a different size, the remote copy is
 downloaded and clobbers the local no matter what the dates are.  I hope
 this is clear, sorry if I have not explained the problem well.  Let me
 know if you have anymore ideas and if you need me to try again to
 explain.  Thanks for your help.

 Preston
 [EMAIL PROTECTED]

-- 
PINE 4.55 Mailer - www.washington.edu/pine/
source-included, proprietary, gratis, text-based, console email clientIndex: src/ftp.c
===
RCS file: /pack/anoncvs/wget/src/ftp.c,v
retrieving revision 1.61
diff -u -c -r1.61 ftp.c
*** src/ftp.c   2003/01/11 20:12:35 1.61
--- src/ftp.c   2003/07/30 17:43:04
***
*** 1360,1369 
  tml++;
  #endif
/* Compare file sizes only for servers that tell us correct
!  values. Assumme sizes being equal for servers that lie
!  about file size.  */
cor_val = (con-rs == ST_UNIX || con-rs == ST_WINNT);
!   eq_size = cor_val ? (local_size == f-size) : 1 ;
  if (f-tstamp = tml  eq_size)
{
  /* Remote file is older, file sizes can be compared and
--- 1360,1370 
  tml++;
  #endif
/* Compare file sizes only for servers that tell us correct
!  values. Assume sizes being equal for servers that lie
!  about file size, or if givin the ignore length option */
cor_val = (con-rs == ST_UNIX || con-rs == ST_WINNT);
!   eq_size = cor_val
!!opt.ignore_length ? (local_size == f-size) : 1;
  if (f-tstamp = tml  eq_size)
{
  /* Remote file is older, file sizes can be compared and


RE: wget: ftp through http proxy not working with 1.8.2. It doeswork with 1.5.3

2003-07-14 Thread Aaron S. Hawley

Wget maintainer:
http://www.geocrawler.com/archives/3/409/2003/3/0/10399285/

-- 
The geocrawler archives for Wget are alive again!

On Mon, 14 Jul 2003, Hans Deragon (QA/LMC) wrote:

 Hi again.

   Some people have reported experiencing the same problem, but nobody
 from the development team has forwarded a comment on this.  Anybody can
 tell us if this is bug or some config issue?


 Regards,
 Hans Deragon


Re: selecting range to download with wget

2003-07-10 Thread Aaron S. Hawley
how about the -Q, or --quota, option?

On Thu, 10 Jul 2003, fehmi ben njima wrote:

 hello
 i am using an usb key a storage disk in school
 and i wana download file that are biger than the capacity of the usb disc
 so i wana a script or modification to make in wget source code
 so i can specifiy the downloading range

 --
 Fehmi Ben njima
 Etudiant Dea Oss à l'universite de technologie de troyes

-- 
GNU Press - Free (non-gratis) books from the Free Software Foundation
www.gnupress.org


Re: Capture HTML Stream

2003-07-09 Thread Aaron S. Hawley
shit, i'd just use lynx or links to do

links -source www.washpost.com

but wget could do

wget -O /dev/stdout www.washpost.com

On Wed, 9 Jul 2003, Jerry Coleman wrote:

 Is there a way to suppress the creation of a .html file, and instead
 redirect the output to stdout?   I want to issue a wget command, and just
 grep for some data that is contained in the resulting html file.

 Maybe I am missing a command line option to do this, but I don't see one.

-- 
GNU Press
www.gnupress.org


Re: Fw: wget with openssl problems

2003-07-09 Thread Aaron S. Hawley
we're all used to J K's personality, now.

On Wed, 9 Jul 2003, Toby Corkindale wrote:

 What's your problem?
 That has to be the least informative email I've seen in a long time.

 tjc
 (apologies for top-posting in reply)

 On Thu, Jul 03, 2003 at 03:20:28PM +0200, J K wrote:
  FUCK YOU!!!
  snip


Re: Capture HTML Stream

2003-07-09 Thread Aaron S. Hawley
try also:

wget -O - www.washpost.com

On Wed, 9 Jul 2003, Gisle Vanem wrote:

 Aaron S. Hawley [EMAIL PROTECTED] said:

  but wget could do
 
  wget -O /dev/stdout www.washpost.com

 On DOS/Windows too? I think not. There must be a better way.

 --gv


Re: Calling cgi from the downloaded page - Simulating a Browser

2003-07-08 Thread Aaron S. Hawley
try:

wget -p

On Tue, 8 Jul 2003 [EMAIL PROTECTED] wrote:


 I am able to download a HTML-page.
 Taht page has several cgi-calls generating images.
 When calling that page with a browser the images are generated and
 stored for further usage in a image cache on the server.

 I was expecting that the download of the HTML page would cause
 the cgi-generation of the images and their storage in cache.

 Why is that not the case when downloading a HTML page to disk?

 When the browser renders the HTML page,__he__ is calling the cgi string.

 Is there a mode in wget to present the downloades page as it is made
 in a browse?

 Thank you, Maggi


Re: Windows Schedule tool for starting/stopping wget?

2003-07-03 Thread Aaron S. Hawley
no such facility currently exists for wget.  this is a question of job
control and is better directed at your operating system.

On Thu, 3 Jul 2003 [EMAIL PROTECTED] wrote:

 Hi

 I'm calling the wget program via a .bat file on a win2000 PC. Works ok.
 I have to schedule the start/stop of this, so i'm sure wget does not start
 before afternoon, and then stops eventhough it's not finished, at a specfied
 time, due to other sync jobs that has to run.

 I have tried to use the simple Scheduled task in WIN2000, but this will not stop
 the wget process again at a specified time.

 Anyone got a hint here what i have to do/use?

 Best Regards / Venlig Hilsen
 Lars Myrthu Rasmussen


Re: Deleting files locally, that is not present remote any more?

2003-07-03 Thread Aaron S. Hawley
the feature to locally delete mirrored files that were not downloaded from
the server on the most recent wget --mirror has been requested previously.

On Thu, 3 Jul 2003 [EMAIL PROTECTED] wrote:

 Hi

 Just started to test wget on a win2000 PC. I'm using the mirror functionality,
 and it seems to work ok,
 Sometimes files on the remote ftp server are removed deliberately.
 I wonder if it's possible to have wget to remove those files locally also, so it
 actually is a 100% mirror that exists on the local server?

 Best Regards / Venlig Hilsen
 Lars Myrthu Rasmussen


--spider v. Server Support

2003-06-18 Thread Aaron S. Hawley
Here's a test case for the --spider option.  perhaps helpful for
documentation?

using wget on about 17,000 URLs (these are in the FSF/UNESCO Free Software
Directory and are not by any means unique).  out of these about 395
generate errors when run with the spider option (--spider) of the wget
command.  --spider uses an HTTP HEAD request rather than a GET

interestingly when running again without the spider command (instead just
using --delete-after) only 334 URLs generate errors.  61 of these URLs
actually worked.  most likely because the server only supported the HTTP
GET request and choked on the HTTP HEAD request.

not necessarily a worthy sample, but 61 out of 17000 URLs don't support
wget's spidering technique.

/a


Re: suggestion

2003-06-17 Thread Aaron S. Hawley
it's available in the CVS version..

information at:
http://www.gnu.org/software/wget/

On Tue, 17 Jun 2003, Roman Dusek wrote:

 Dear Sirs,

 thanks for WGet, it's a great tool. I would very appreciate one more
 option: a possibility to get http page using POST method instead of GET.

 Cheers,
 Roman

-- 
Women do two-thirds of the work for five percent of the world's income.


--spider problems

2003-06-17 Thread Aaron S. Hawley
I use the --spider option a lot and don't have trouble with most sites.
When using the --spider option for the Mozilla website I get a 500 error
response.  Without the --spider option I don't receive the problem.  Any
guesses?

$ wget --debug --spider www.mozilla.org
DEBUG output created by Wget 1.9-beta on aix4.3.3.0.

--14:43:09--  http://www.mozilla.org/
   = `index.html'
Resolving www.mozilla.org... done.
Caching www.mozilla.org = 207.200.81.215
Connecting to www.mozilla.org[207.200.81.215]:80... connected.
Created socket 3.
Releasing 20016c58 (new refcount 1).
---request begin---
HEAD / HTTP/1.0
User-Agent: Wget/1.9-beta
Host: www.mozilla.org
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 500 Server Error
Server: Netscape-Enterprise/3.6
Date: Tue, 17 Jun 2003 18:43:09 GMT
Content-length: 305
Content-type: text/html
Connection: keep-alive


Found www.mozilla.org in host_name_addresses_map (20016c58)
Registered fd 3 for persistent reuse.
Closing fd 3
Releasing 20016c58 (new refcount 1).
Invalidating fd 3 from further reuse.
14:43:10 ERROR 500: Server Error.

$ wget www.mozilla.org
--14:43:42--  http://www.mozilla.org/
   = `index.html'
Resolving www.mozilla.org... done.
Connecting to www.mozilla.org[207.200.81.215]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15,839 [text/html]

100%[] 15,83922.65K/s

14:43:43 (22.65 KB/s) - `index.html' saved [15839/15839]


RE: wget feature requests

2003-06-17 Thread Aaron S. Hawley
i submitted a patch in february.

http://www.mail-archive.com/wget%40sunsite.dk/msg04645.html
http://www.geocrawler.com/archives/3/409/2003/2/100/10313375/

On Tue, 17 Jun 2003, Peschko, Edward wrote:

 Just upgraded to 1.8.2 and ok, I think I see the problem here...

 --spider only works with html files.. right? If so, why?

  Ed


-- 
PINE 4.55 Mailer - www.washington.edu/pine/
source-included, proprietary, gratis, text-based, console email client


Re: Feature Request: Fixed wait

2003-06-17 Thread Aaron S. Hawley
how is your request different than --wait ?

On Mon, 16 Jun 2003, Wu-Kung Sun wrote:

 I'd like to request an additional (or modified) option
 that waits for whatever time specified by the user, no
 more no less (instead of the linear backoff of
 --waitretry which is just a slightly less obnoxious
 form of hammering).  Looks like it would only take a
 few lines of code but I can't figure out how to
 compile on Cygwin.

 If someone knows how to compile it on Cygwin, can do
 it with VC++, or some other wget-ish win32 program
 (fairly small and noninteractive), I'd really,
 _*REALLY*_ appreciate it!
 swk


Re: trailing '/' of include-directories removed bug

2003-06-16 Thread Aaron S. Hawley
you're right, the include-directories option operates much the same way
(my guess in the interest of speed) as the rest of the accept/reject
options.

which (others have also noticed) is a little flakey.
/a

On Fri, 13 Jun 2003, wei ye wrote:

 Did you test your patch? I patched it on my source code and it doesn't work.

 There are lot of files under http://biz.yahoo.com/edu/, but
 the patched code only downloaded the index.html.

 [EMAIL PROTECTED] src]$ ./wget -r --domains=biz.yahoo.com -I /edu/
 http://biz.yahoo.com/edu/
 [EMAIL PROTECTED] src]$ ls biz.yahoo.com/
 edu/
 [EMAIL PROTECTED] src]$ ls biz.yahoo.com/edu/
 index.html
 [EMAIL PROTECTED] src]$


 Here is the debug info, note that in proclist() function, frontcmp(p, s)
 supposed return 1, but it returns 0.
 `p' is 'edu/' which, keed the trailing '/' from parameter, and 's'
 is 'edu' - the directory of crawled url. Since 's' doesn't start with 'p',
 then it failed.

 If pass the url's 'path' instead of 'dir' to accdir(), it may work.

 Actually, I really recommend change the '-include-directories' parameter to
 '-include-urls'(so does -exlclude..). Then keeps the '/' characters in the
 parameter make more sense and easier to use. I used htdig before, which uses
 'exclude_urls: /cgi-bin/' as well in its configuration.


 [EMAIL PROTECTED] src]$ gdb wget
 (gdb) b accdir
 Breakpoint 1 at 0x806cb42: file utils.c, line 714.
 (gdb) run -r  --domains=biz.yahoo.com -I /edu/ http://biz.yahoo.com/edu/
 Starting program: /home/weiye/downloads/wget-1.8.2/src/wget -r
 --domains=biz.yahoo.com - I /edu/ http://biz.yahoo.com/edu/
 --18:55:07--  http://biz.yahoo.com/edu/
= `biz.yahoo.com/edu/index.html'
 Resolving biz.yahoo.com... done.
 Connecting to biz.yahoo.com[66.163.175.141]:80... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]

 [ =  ] 6,741  6.43M/s


 18:55:07 (6.43 MB/s) - `biz.yahoo.com/edu/index.html' saved [6741]


 Breakpoint 1, accdir (directory=0x8089df0 edu, flags=ALLABS) at utils.c:714
 714   if (flags  ALLABS  *directory == '/')
 (gdb) n
 716   if (opt.includes)
 (gdb)
 718   if (!proclist (opt.includes, directory, flags))
 (gdb) s
 proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at utils.c:690
 690   for (x = strlist; *x; x++)
 (gdb) n
 691 if (has_wildcards_p (*x))
 (gdb) p *x
 $1 = 0x807f0a0 /edu/
 (gdb) n
 698 char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove
 '/' */
 (gdb)
 699 if (frontcmp (p, s))
 (gdb) p p
 $2 = 0x807f0a1 edu/
 (gdb) p s
 $3 = 0x8089df0 edu
 (gdb) p p
 $4 = 0x807f0a1 edu/
 (gdb) n
 701   }
 (gdb) bt
 #0  proclist (strlist=0x807f090, s=0x8089df0 edu, flags=ALLABS) at
 utils.c:701
 #1  0x806cb76 in accdir (directory=0x8089df0 edu, flags=ALLABS) at
 utils.c:718
 #2  0x8064d8d in download_child_p (upos=0x807e7e0, parent=0x808c800, depth=0,
 start_url_parsed=0x808, blacklist=0x807e100) at recur.c:514
 #3  0x80648b0 in retrieve_tree (start_url=0x807e080
 http://biz.yahoo.com/edu/;)
 at recur.c:348
 #4  0x8062179 in main (argc=6, argv=0x9fbff444) at main.c:822
 #5  0x804a20d in _start ()
 (gdb)

 Thanks very much!!

-- 
Consider supporting GNU Software and the Free Software Foundation
By Buying Stuff - http://www.gnu.org/gear/
  (GNU and FSF are not responsible for this promotion
   nor do they necessarily agree with the views or opinions of the author)


Re: trailing '/' of include-directories removed bug

2003-06-13 Thread Aaron S. Hawley
no, i think your original idea of getting rid of the code that removes the
trailing slash is a better idea.  i think this would fix it but keep the
degenerate case of root directory (whatever that's about):

Index: src/init.c
===
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.54
diff -u -u -r1.54 init.c
--- src/init.c  2002/08/03 20:34:57 1.54
+++ src/init.c  2003/06/13 20:24:16
@@ -753,7 +753,6 @@

   if (*val)
 {
-  /* Strip the trailing slashes from directories.  */
   char **t, **seps;

   seps = sepstring (val);
@@ -761,10 +760,10 @@
{
  int len = strlen (*t);
  /* Skip degenerate case of root directory.  */
- if (len  1)
+ if (len == 1)
{
- if ((*t)[len - 1] == '/')
-   (*t)[len - 1] = '\0';
+ if ((*t)[0] == '/')
+   (*t)[0] = '\0';
}
}
   *pvec = merge_vecs (*pvec, seps);

On Thu, 12 Jun 2003, wei ye wrote:

 For the situation I only need '/r/', there is no option for I to do that.

 If user need '/r*/', they should specify -I '/r*/' instead.

 Simple patch attached, please consider it. Thanks!!

 [EMAIL PROTECTED] src]$ diff  -u utils.c.orig utils.c
 --- utils.c.origFri May 17 20:05:22 2002
 +++ utils.c Thu Jun 12 20:24:21 2003
 @@ -696,7 +696,9 @@
  else
{
 char *p = *x + ((flags  ALLABS)  (**x == '/')); /* Remove '/' */
 -   if (frontcmp (p, s))
 +   /* if *p=c, pass if s is c or c/... not ca */
 +   int plen = strlen(p);
 +   if ( (strncmp (p, s, plen) == 0)  (s[plen] == '/' || s[plen] == '\0')
 )
   break;
}
return *x;
 [EMAIL PROTECTED] src]$


-- 
I get threatening vacation messages from J K, too.


Re: wget recursion options ?

2003-06-12 Thread Aaron S. Hawley
there doesn't seem to be anything wrong with the page.

are you having trouble with recursive wgets with other (all) pages or just
this one.
/a


Re: trailing '/' of include-directories removed bug

2003-06-12 Thread Aaron S. Hawley
above the code segment you submitted (line 765 of init.c) the
comment:

/* Strip the trailing slashes from directories.  */

here are the manual notes on this option:

(from Recursive Accept/Reject Options)

`-I list'
`--include-directories=list'
Specify a comma-separated list of directories you wish to follow when
downloading (See section Directory-Based Limits for more details.)
Elements of list may contain wildcards.

 --- and ---

(from Directory-Based Limits)

`-I list'
`--include list'
`include_directories = list'
`-I' option accepts a comma-separated list of directories included in
the retrieval. Any other directories will simply be ignored. The
directories are absolute paths. So, if you wish to download from
`http://host/people/bozo/' following only links to bozo's colleagues in
the `/people' directory and the bogus scripts in `/cgi-bin', you can
specify:

wget -I /people,/cgi-bin http://host/people/bozo/

---

On Wed, 11 Jun 2003, wei ye wrote:

 I'm trying to crawl url with  --include-directories='/r/'
 parameter.

 I expect to crawl '/r/*', but wget gives me '/r*'.

 By reading the code, it turns out that cmd_directory_vector()
 removed the trailing '/' of include-directories '/r/'.

 It's a minor bug, but I hope it could be fix in next version.

 Thanks!

 static int cmd_directory_vector(...) {
  ...
   if (len  1)
 {
   if ((*t)[len - 1] == '/')
 (*t)[len - 1] = '\0';
 }
  ...

 }

 =
 Wei Ye

-- 
Yahweh commanded Abraham to sacrifice his only son Isaac on the top of a
mountain. When Abraham asked why, Yahweh replied because 'I am God.' When
I heard this story the first time, I promised myself to check out
atheism.  -- Louis Proyect www.marxmail.org/


Re: trailing '/' of include-directories removed bug

2003-06-12 Thread Aaron S. Hawley
oh, i understand your problem.  your request seems reasonable.  i was
trying to see if anyone had an idea why it seemed to be more of a
feature than a bug.

On Thu, 12 Jun 2003, wei ye wrote:


 Please take a look this example:
 $ \rm -rf biz.yahoo.com
 $ ls biz.yahoo.com
 $ wget -r  --domains=biz.yahoo.com -I /r/ 'http://biz.yahoo.com/r/'
 $ ls biz.yahoo.com/
 r/  reports/research/
 $

 I want only '/r/', but it crawls /r*, which includes /reports/, /research/.

 Is it an expected result or a bug?

 Thanks alot!


 --- Aaron S. Hawley [EMAIL PROTECTED] wrote:
  above the code segment you submitted (line 765 of init.c) the
  comment:
 
  /* Strip the trailing slashes from directories.  */
 
  here are the manual notes on this option:
 
  (from Recursive Accept/Reject Options)
 
  `-I list'
  `--include-directories=list'
  Specify a comma-separated list of directories you wish to follow when
  downloading (See section Directory-Based Limits for more details.)
  Elements of list may contain wildcards.
 
   --- and ---
 
  (from Directory-Based Limits)
 
  `-I list'
  `--include list'
  `include_directories = list'
  `-I' option accepts a comma-separated list of directories included in
  the retrieval. Any other directories will simply be ignored. The
  directories are absolute paths. So, if you wish to download from
  `http://host/people/bozo/' following only links to bozo's colleagues in
  the `/people' directory and the bogus scripts in `/cgi-bin', you can
  specify:
 
  wget -I /people,/cgi-bin http://host/people/bozo/
 
  ---
 
  On Wed, 11 Jun 2003, wei ye wrote:
 
   I'm trying to crawl url with  --include-directories='/r/'
   parameter.
  
   I expect to crawl '/r/*', but wget gives me '/r*'.
  
   By reading the code, it turns out that cmd_directory_vector()
   removed the trailing '/' of include-directories '/r/'.
  
   It's a minor bug, but I hope it could be fix in next version.
  
   Thanks!
  
   static int cmd_directory_vector(...) {
...
 if (len  1)
   {
 if ((*t)[len - 1] == '/')
   (*t)[len - 1] = '\0';
   }
...
  
   }
  
   =
   Wei Ye

-- 
Fight for Free Digital Speech
www.digitalspeech.org


Re: WGET help needed

2003-06-11 Thread Aaron S. Hawley
http://www.gnu.org/manual/wget/

On Wed, 11 Jun 2003, Support, DemoG wrote:

 hello,

 I need help on this subject:
 Please tell me what is the command line if i wanted to get all the files, 
 subdirectories with all contained from a ftp like ftp.mine.com
 also i have the user and pass, and i will use this in Shell.


Re: method to download this url

2003-06-06 Thread Aaron S. Hawley
Wget Manual: Types of Files
http://www.gnu.org/manual/wget/html_node/wget_19.html

On Thu, 5 Jun 2003, Payal Rathod wrote:

 Hi all,
 I need some kind of method to download only the questions andn answers listedk in 
 this url. I don't want any picutes, just question
 andtheir answers.
 The url is http://qmail.faqts.com/
 It is harder than it looks.
 I want to download all the questions and answers. Can anyone suggest any method?
 Thanks a lot.
 With warmm regards,
 -Payal
 p.s. kindly mark a cc to me.


Re: Comment handling

2003-06-05 Thread Aaron S. Hawley
On Wed, 4 Jun 2003, Tony Lewis wrote:

 Adding this function to wget seems reasonable to me, but I'd suggest that it
 be off by default and enabled from the command line with something
 like --quirky_comments.

why not just have the default wget behavior follow comments explicitly
(i've lost track whether wget does that or needs to be ammended) /and/
have an option that goes /beyond/ quirky comments and is just
--ignore-comments ? :)

/a


Re: Comment handling

2003-06-05 Thread Aaron S. Hawley
i suppose my proposal should have been called --disobey-comments (comments
are already ignored by default).

i'm just saying what's going to happen when someone posts to this list:
My Web Pages have [insert obscure comment format] for comments and Wget
is considering them to (not) be comments.  Can you change the [insert
Wget comment mode] comment mode to (not) recognize my comments?

i think the idea of quirky comments modes are cool, but is it the better
solution?
/a

On Wed, 4 Jun 2003, Aaron S. Hawley wrote:

 why not just have the default wget behavior follow comments explicitly
 (i've lost track whether wget does that or needs to be ammended) /and/
 have an option that goes /beyond/ quirky comments and is just
 --ignore-comments ? :)

 /a


Re: Comment handling

2003-06-05 Thread Aaron S. Hawley
On Wed, 4 Jun 2003, George Prekas wrote:

 snip

  i think the idea of quirky comments modes are cool, but is it the better
  solution?

 Do you think that the current algorithm shouldn't be improved? Even, a
 little bit to handle the common mistakes?

i think Wget's default behavior should be improved where reasonable.  i
know people had profiled Wget's current behavior and profiled proposals
for more reasonable behavior, but i can't find a web archive of those
posts.

/a


Re: wget vs mms://*.wmv?

2003-06-04 Thread Aaron S. Hawley
another propretary protocol brought to you by the folks in redmond,
washington.

http://sdp.ppona.com/
http://geocities.com/majormms/

On Sun, 1 Jun 2003, Andrzej Kasperowicz wrote:

 How could I download using wget that:
 mms://mms.itvp.pl/bush_archiwum/bush.wmv

 If wget cannot manage it then what can?

 Cheers!
 Andy

-- 
Our armies do not come into your cities and lands as conquerors or
enemies, but as liberators.
 - British Lt. Gen. Stanley Maude. Proclamation to the People of the
   Wilayat of Baghdad. March 8, 1917.


Re: Comment handling

2003-05-31 Thread Aaron S. Hawley
On Fri, 30 May 2003, George Prekas wrote:

 I have found a bug in Wget version 1.8.2 concerning comment handling ( !--
 comment -- ). Take a look at the following illegal HTML code:
 HTML
 BODY
 a href=test1.htmltest1.html/a
 !--
 a href=test2.htmltest2.html/a
 !--
 /BODY
 /HTML

 Now, save the above snippet as test.html and try wget -Fi test.html. You
 will notice that it doesn't recognise the second link.

is it really an invalid comment?  i didn't see the second link when
viewing the file in `lynx` or `links`.

 I have found a solution to the above situation and have properly patched
 html-parse.c and I would like some info on how can I give you the patch.

 Regards,
 George Prekas

 P.S. Sorry about this message, but it appears that the first one did not
 show up in the list.

i think it showed up twice (or maybe i'm getting duplicates).  yeah the
web archives suck.  they should be put on mail.gnu.org


string comparison macro (was Re: using user-agent to identify forrobots.txt

2003-05-30 Thread Aaron S. Hawley
while investigating this bug i noticed the following macro defined on line
197 of wget.h:

/* The same as above, except the comparison is case-insensitive. */
#define BOUNDED_EQUAL_NO_CASE(beg, end, string_literal) \
  ((end) - (beg) == sizeof (string_literal) - 1 \
!strncasecmp ((beg), (string_literal),\
sizeof (string_literal) - 1))

shouldn't it be strlen not sizeof?

interestingly the byte size of a char ptr is equal to the length of the
string Wget?  coincidence?

/a

On Wed, 28 May 2003, Christian von Ferber wrote:

 Hi,

 I am mirroring a friendly site that excludes robots in general but
 is supposed to allow my FriendlyMirror using wget.
 For this purpose I asked the webadmin to set up his robots.txt as follows:

 clip


Re: using user-agent to identify for robots.txt

2003-05-30 Thread Aaron S. Hawley
This patch seems to do user-agent checks correctly (it might have been
broken previously) with a correction to a string comparison macro.

The patch also uses the value of the --user-agent option when enforcing
robots.txt rules.

this patch is against CVS, more on that here:
http://www.gnu.org/software/wget/

Index: init.c
===
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.54
diff -u -u -r1.54 init.c
--- init.c  2002/08/03 20:34:57 1.54
+++ init.c  2003/05/29 17:51:50
@@ -271,6 +271,7 @@
   opt.timeout = 900;
 #endif
   opt.use_robots = 1;
+  opt.useragent = xstrdup (Wget);

   opt.remove_listing = 1;

Index: res.c
===
RCS file: /pack/anoncvs/wget/src/res.c,v
retrieving revision 1.7
diff -u -u -r1.7 res.c
--- res.c   2002/05/18 02:16:24 1.7
+++ res.c   2003/05/29 17:51:50
@@ -115,7 +115,7 @@
   *matches = 1;
   *exact_match = 0;
 }
-  else if (BOUNDED_EQUAL_NO_CASE (agent, agent + length, wget))
+  else if (BOUNDED_EQUAL_NO_CASE (agent, agent + length, opt.useragent))
 {
   *matches = 1;
   *exact_match = 1;
@@ -355,7 +355,7 @@
}
   else
{
- DEBUGP ((Ignoring unknown field at line %d, line_count));
+ DEBUGP ((Ignoring unknown field at line %d\n, line_count));
  goto next;
}

Index: wget.h
===
RCS file: /pack/anoncvs/wget/src/wget.h,v
retrieving revision 1.34
diff -u -u -r1.34 wget.h
--- wget.h  2002/05/18 02:16:25 1.34
+++ wget.h  2003/05/29 17:51:50
@@ -189,15 +189,15 @@
 /* Return non-zero if string bounded between BEG and END is equal to
STRING_LITERAL.  The comparison is case-sensitive.  */
 #define BOUNDED_EQUAL(beg, end, string_literal)\
-  ((end) - (beg) == sizeof (string_literal) - 1\
+  ((end) - (beg) == strlen (string_literal) - 1\
 !memcmp ((beg), (string_literal),\
-  sizeof (string_literal) - 1))
+  strlen (string_literal) - 1))

 /* The same as above, except the comparison is case-insensitive. */
 #define BOUNDED_EQUAL_NO_CASE(beg, end, string_literal)\
-  ((end) - (beg) == sizeof (string_literal) - 1\
+  ((end) - (beg) == strlen (string_literal)\
 !strncasecmp ((beg), (string_literal),   \
-   sizeof (string_literal) - 1))
+   strlen (string_literal)))

 /* Note that this much more elegant definition cannot be used:

On Wed, 28 May 2003, Christian von Ferber wrote:

 Hi,

 I am mirroring a friendly site that excludes robots in general but
 is supposed to allow my FriendlyMirror using wget.
 For this purpose I asked the webadmin to set up his robots.txt as follows:

 User-agent: FriendlyMirror
 Disallow:

 User-agent: *
 Disallow: /

 Starting Wget by

 wget --user-agent FriendlyMirror -m http://Friendly.Site

 Wget indeed identifies as user-agent FriendlyMirror to Friendly.Site
 but considers itself to be user-agent Wget when implementing the rules
 of robots.txt.

 I think it would be nice if Wget could be told to interpret robots.txt
 such that only my FriendlyMirror and not all other robots using wget
 will continue automatic download.

 Any Ideas ?

 Cheers,

 Christian

-- 
#undef MACROS? .libs
? Makefile
? config.h
? wget
Index: init.c
===
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.54
diff -u -u -r1.54 init.c
--- init.c  2002/08/03 20:34:57 1.54
+++ init.c  2003/05/29 17:51:50
@@ -271,6 +271,7 @@
   opt.timeout = 900;
 #endif
   opt.use_robots = 1;
+  opt.useragent = xstrdup (Wget);
 
   opt.remove_listing = 1;
 
Index: res.c
===
RCS file: /pack/anoncvs/wget/src/res.c,v
retrieving revision 1.7
diff -u -u -r1.7 res.c
--- res.c   2002/05/18 02:16:24 1.7
+++ res.c   2003/05/29 17:51:50
@@ -115,7 +115,7 @@
   *matches = 1;
   *exact_match = 0;
 }
-  else if (BOUNDED_EQUAL_NO_CASE (agent, agent + length, wget))
+  else if (BOUNDED_EQUAL_NO_CASE (agent, agent + length, opt.useragent))
 {
   *matches = 1;
   *exact_match = 1;
@@ -355,7 +355,7 @@
}
   else
{
- DEBUGP ((Ignoring unknown field at line %d, line_count));
+ DEBUGP ((Ignoring unknown field at line %d\n, line_count));
  goto next;
}
 
Index: wget.h
===
RCS file: /pack/anoncvs/wget/src/wget.h,v
retrieving revision 1.34
diff -u -u -r1.34 wget.h
--- wget.h  2002/05/18 02:16:25 1.34
+++ wget.h  2003/05/29 17:51:50
@@ -189,15 +189,15 @@
 /* Return non-zero if string bounded between BEG and END is equal to
STRING_LITERAL.  The comparison is 

Re: string comparison macro (was Re: using user-agent to identifyfor

2003-05-30 Thread Aaron S. Hawley
yeah, i guess that patch is really bad.

http://www.gnu.org/manual/glibc/html_node/String-Length.html

On Thu, 29 May 2003, Larry Jones wrote:

 Aaron S. Hawley writes:
 
  shouldn't it be strlen not sizeof?

 No.  An array is not converted to a pointer when it is the argument of
 sizeof, so sizeof a string literal is the number of bytes in the string
 (including the terminating NUL), not the size of a char *.

 -Larry Jones

 I hate being good. -- Calvin


Re: using user-agent to identify

2003-05-30 Thread Aaron S. Hawley
Maybe this one is better:

Index: src/init.c
===
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.54
diff -u -u -r1.54 init.c
--- src/init.c  2002/08/03 20:34:57 1.54
+++ src/init.c  2003/05/29 19:08:51
@@ -271,6 +271,7 @@
   opt.timeout = 900;
 #endif
   opt.use_robots = 1;
+  opt.useragent = xstrdup (Wget);

   opt.remove_listing = 1;

Index: src/res.c
===
RCS file: /pack/anoncvs/wget/src/res.c,v
retrieving revision 1.7
diff -u -u -r1.7 res.c
--- src/res.c   2002/05/18 02:16:24 1.7
+++ src/res.c   2003/05/29 19:08:51
@@ -115,7 +115,7 @@
   *matches = 1;
   *exact_match = 0;
 }
-  else if (BOUNDED_EQUAL_NO_CASE (agent, agent + length, wget))
+  else if (strncasecmp (agent, opt.useragent, length) == FALSE)
 {
   *matches = 1;
   *exact_match = 1;
@@ -355,7 +355,7 @@
}
   else
{
- DEBUGP ((Ignoring unknown field at line %d, line_count));
+ DEBUGP ((Ignoring unknown field at line %d\n, line_count));
  goto next;
}

On Thu, 29 May 2003, Larry Jones wrote:

 Aaron S. Hawley writes:
 
  yeah, i guess that patch is really bad.

 Yes, it is.  ;-)

 -Larry JonesIndex: src/init.c
===
RCS file: /pack/anoncvs/wget/src/init.c,v
retrieving revision 1.54
diff -u -u -r1.54 init.c
--- src/init.c  2002/08/03 20:34:57 1.54
+++ src/init.c  2003/05/29 19:08:51
@@ -271,6 +271,7 @@
   opt.timeout = 900;
 #endif
   opt.use_robots = 1;
+  opt.useragent = xstrdup (Wget);
 
   opt.remove_listing = 1;
 
Index: src/res.c
===
RCS file: /pack/anoncvs/wget/src/res.c,v
retrieving revision 1.7
diff -u -u -r1.7 res.c
--- src/res.c   2002/05/18 02:16:24 1.7
+++ src/res.c   2003/05/29 19:08:51
@@ -115,7 +115,7 @@
   *matches = 1;
   *exact_match = 0;
 }
-  else if (BOUNDED_EQUAL_NO_CASE (agent, agent + length, wget))
+  else if (strncasecmp (agent, opt.useragent, length) == FALSE)
 {
   *matches = 1;
   *exact_match = 1;
@@ -355,7 +355,7 @@
}
   else
{
- DEBUGP ((Ignoring unknown field at line %d, line_count));
+ DEBUGP ((Ignoring unknown field at line %d\n, line_count));
  goto next;
}
 


[wget] maintainer

2003-03-20 Thread Aaron S. Hawley
rms put this up last week

quoted from: http://www.gnu.org/help/help.html

How to Help the GNU Project

This list is ordered roughly with the more urgent items near the top.
Please note that many things on this list link to larger, expanded lists.

* We are looking for new maintainers for these GNU packages (contact
  [EMAIL PROTECTED] if you'd like to volunteer):
  o GNU dumb
  o The Hyperbole GNU Emacs Package
  o GNU UnRTF
  o wget (which still has a maintainer, but he would like to step
down)


Re: Wget a Post Form

2003-03-18 Thread Aaron S. Hawley
my guess is that this probably isn't in the manual.

% wget --version
GNU Wget 1.9-beta

Copyright (C) 1995, 1996, 1997, 1998, 2000, 2001 Free Software Foundation,
Inc.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

Originally written by Hrvoje Niksic [EMAIL PROTECTED].

%wget --help
..
   --post-data=STRINGuse the POST method; send STRING as the data.
   --post-file=FILE  use the POST method; send contents of FILE.
..

On Tue, 18 Mar 2003, John Bennett wrote:

 I'm a really newbie when it comes to using wget and I have looked through
 the website and read through the documentation but was unable to find any
 information on using wget to supply values from a post method html form.
 Is this possible with wget?  How do you send the server the paired name
 and value items with wget?  Are the paired values supplied on the command
 line or are they supplied in a file and then supply the filename on the
 command line?

 Any help would be much appreciated.

 Please cc: me when replying since I am not a member of the list.

 Thanks in advance!
 John Bennett


wget future (was Re: Not 100% rfc 1738 complience for FTP URLs =bug

2003-03-17 Thread Aaron S. Hawley
On Thu, 13 Mar 2003, Max Bowsher wrote:

  David Balazic wrote:
 
  So it is do it yourself , huh ? :-)

 More to the point, *no one* is available who has cvs write access.

what if for the time being the task of keeping track of submissions for
wget was done with its debian package?

http://bugs.debian.org/wget
http://packages.qa.debian.org/wget

that way, at least some of the work of incorporating and releasing and
testing these code submissions can be accomplished, making things perhaps
slightly easier when the wget authors get back.

/end lame idea


Re: wget --spider doesn't work for ftp URLs

2003-03-06 Thread Aaron S. Hawley
a patch was submitted:
http://www.mail-archive.com/wget%40sunsite.dk/msg04645.html

On Thu, 6 Mar 2003, Keith Thompson wrote:

 When invoked with an ftp://; URL, wget's the --spider option is
 silently ignored and the file is downloaded.  This applies to wget
 version 1.8.2.

 To demonstrate:

 wget --spider http://www.gnu.org/index.html
 # doesn't grab the file

 wget --spider ftp://ftp.gnu.org/gnu/wget/wget-1.8.2.tar.gz
 # grabs the file

 Ideally, it should work.  At least, it should complain.



-- 
__
PINE 4.53 Mailer - www.washington.edu/pine


Re: wget 301 redirects broken in 1.8.2

2003-02-24 Thread Aaron S. Hawley
this bug is confirmed in CVS, it looks like there's been a lot of changes
to html-url.c

/a

On Thu, 20 Feb 2003, Jamie Zawinski wrote:

 Try this, watch it lose:

 wget --debug -e robots=off -nv -m -nH -np \
  http://www.dnalounge.com/flyers/

 http://www.dnalounge.com/flyers/ does a 301-redirect to
 http://www.dnalounge.com/flyers/latest.html which then 301-redirects to
 http://www.dnalounge.com/flyers/2003/02/

 wget gets the HTML from that last URL, but it expands relative URLs
 wrong, and instead of trying to retrieve
 http://www.dnalounge.com/flyers/2003/02/06-affliction.html it tries
 http://www.dnalounge.com/flyers/06-affliction.html which doesn't exist.

  flyers/2003/02/index.html: merge(http://www.dnalounge.com/flyers/latest.html;, 
  06-affliction.html) - http://www.dnalounge.com/flyers/06-affliction.html

 wget 1.8-beta1 did not have this problem: that version works properly.

-- 
__
PINE 4.53 Mailer - www.washington.edu/pine


dev. of wget (was Re: Removing files and directories not present onremote FTP server

2003-02-14 Thread Aaron S. Hawley
On Fri, 14 Feb 2003, Max Bowsher wrote:

 If and when someone decides to implement it. But there is almost certainly
 not going to be another release until after Hrvoje Niksic has returned.

Can someone at FSF do something?  [EMAIL PROTECTED], [EMAIL PROTECTED]
This seems like the silliest reason to temporarily halt development.

end whining,
/a

-- 
__
PINE 4.53 Mailer - www.washington.edu/pine



Re: dev. of wget (was Re: Removing files and directories not presenton remote FTP server

2003-02-14 Thread Aaron S. Hawley
On Fri, 14 Feb 2003, Daniel Stenberg wrote:

 Technically and theoreticly, anyone can grab the sources, patch the bugs, add
 features and release a new version. That's what the GPL is there for.

 In practise, however, that would mean stepping on Hrvoje's toes and I don't
 think anyone wants to do that. Be it people in the FSF or elsewhere.

I'm sorry to be such a luser and have asked such a lame question.  Thanks
for you kindness.  And indeed, I never meant to request that there be
stepping on anybody else's toes.  It's just unfortunate there wasn't any
continuity for wget.  Of course, it sounds like the lack of continuity was
a result of a worse-case-scenario, and not of malevolence.  I didn't have
any faith the FSF would necessarily come up with something, but I thought
that was obvously one of our only options.

/a

-- 
__
PINE 4.53 Mailer - www.washington.edu/pine



[wget-patch] --spider FTP

2003-02-06 Thread Aaron S. Hawley
I'm aware that there's a desire to re-write the ftp portion of wget, but
here is a patch against CVS that so far allowed me to spider ftp URLs.
it's a dirty hack that simply uses the opt.spider variable to keep from
downloading files by returning RETROK (or maybe it was RETRFINISHED) after
observing whether there was an 505 or 200 from the RETR  command.  also
using opt.spider I attempted to stop any calculations or displaying of
downloads.  thus i didn't really verify whether this is the proper
protocol to spider in ftp, or whether all handles were closed properly.

and AFAICT, it appears to be spidering when --recursive is used.  right
now it will create the directories to write the .listing files (which
can be shut of with --no-directories).

I've been validating URLs in the GNU Free Software Directory's CVS
repository, and so far nothing has been downloaded into the working
directory (i do have --output-document set to /dev/null to make sure).  i
didn't completely verify whether --verbose or --debug are still outputing
legitimate information with --spider.

I think that's everything I know.

wget is great (especially --spider),
/a

ChangeLog and diff to ftp.c follow and are also attached.

2003-02-06  Aaron Hawley [EMAIL PROTECTED]

* ftp.c
(getftp): --spider option should now work with FTP.
(ftp_loop_internal): --spider option will not calculate or
show what was downloaded (nor delete from using --delete-after).
(ftp_loop): --spider will not HTML-ify listing.

Index: ftp.c
===
RCS file: /pack/anoncvs/wget/src/ftp.c,v
retrieving revision 1.61
diff -u -r1.61 ftp.c
--- ftp.c   2003/01/11 20:12:35 1.61
+++ ftp.c   2003/02/07 01:48:37
@@ -818,6 +818,9 @@
   expected_bytes = ftp_expected_bytes (ftp_last_respline);
 } /* cmd  DO_LIST */

+  if (!(cmd  (DO_LIST | DO_RETR)) || (opt.spider  !(cmd  DO_LIST)))
+return RETRFINISHED;
+
   /* Some FTP servers return the total length of file after REST
  command, others just return the remaining size. */
   if (*len  restval  expected_bytes
@@ -828,9 +831,6 @@
 }

   /* If no transmission was required, then everything is OK.  */
-  if (!(cmd  (DO_LIST | DO_RETR)))
-return RETRFINISHED;
-
   if (!pasv_mode_open)  /* we are not using pasive mode so we need
  to accept */
 {
@@ -1153,7 +1153,8 @@
}
   /* Time?  */
   tms = time_str (NULL);
-  tmrate = retr_rate (len - restval, con-dltime, 0);
+  if (!opt.spider)
+tmrate = retr_rate (len - restval, con-dltime, 0);

   /* If we get out of the switch above without continue'ing, we've
 successfully downloaded a file.  Remember this fact. */
@@ -1164,8 +1165,9 @@
  CLOSE (RBUF_FD (con-rbuf));
  rbuf_uninitialize (con-rbuf);
}
-  logprintf (LOG_VERBOSE, _(%s (%s) - `%s' saved [%ld]\n\n),
-tms, tmrate, locf, len);
+  if (!opt.spider)
+logprintf (LOG_VERBOSE, _(%s (%s) - `%s' saved [%ld]\n\n),
+  tms, tmrate, locf, len);
   if (!opt.verbose  !opt.quiet)
{
  /* Need to hide the password from the URL.  The `if' is here
@@ -1192,7 +1194,7 @@
 by the more specific option --dont-remove-listing, and the code
 to do this deletion is in another function. */
}
-  else
+  else if (!opt.spider)
/* This is not a directory listing file. */
{
  /* Unlike directory listing files, don't pretend normal files weren't
@@ -1718,7 +1720,7 @@

   if (res == RETROK)
{
- if (opt.htmlify)
+ if (opt.htmlify  !opt.spider)
{
  char *filename = (opt.output_document
? xstrdup (opt.output_document)
Index: ChangeLog
===
RCS file: /pack/anoncvs/wget/src/ChangeLog,v
retrieving revision 1.417
diff -u -r1.417 ChangeLog
--- ChangeLog   2003/01/11 20:12:35 1.417
+++ ChangeLog   2003/02/07 01:49:49
@@ -1,3 +1,11 @@
+2003-02-06  Aaron Hawley [EMAIL PROTECTED]
+
+   * ftp.c
+   (getftp): --spider option should now work with FTP.
+   (ftp_loop_internal): --spider option will not calculate or
+   show what was downloaded (nor delete from using --delete-after).
+   (ftp_loop): --spider will not HTML-ify listing.
+
 2003-01-11  Ian Abbott [EMAIL PROTECTED]
 
* ftp.c (ftp_retrieve_glob): Reject insecure filenames as determined

Index: ftp.c
===
RCS file: /pack/anoncvs/wget/src/ftp.c,v
retrieving revision 1.61
diff -u -r1.61 ftp.c
--- ftp.c   2003/01/11 20:12:35 1.61
+++ ftp.c   2003/02/07 01:48:37
@@ -818,6 +818,9 @@
   expected_bytes = ftp_expected_bytes (ftp_last_respline);
 } /* cmd  DO_LIST */
 
+  if (!(cmd