is user using wget a robot? error messsages and etc. (was Re: How to simply download directories (http or ftp))

Linda Walsh Sun, 28 Aug 2005 11:51:07 -0700

That work[ed/s]...Thanks!  I actually had started looking at robots.txt
as a possible problem and saw that the site blocked "robots". at the root.
I was looking at the manpage to try to find a simple switch to turn it
off, but didn't see one.  Started thinking about "workarounds" like
having squid return 404's for any file named "robots.txt" which really
seemed like an ugly hack (in terms of social and functional degradation).

But since this site is designed to be a user-download site, it doesn'tseem it

would be "desired' to block _human_ downloading.  While wget can be
used for mirroring a site, maybe, if it is launched from a TTY or it's
controlling terminal is a TTY (vs. launched from a script), _maybe_ (?)
it shouldn't look at robots.txt, since a human wanting to download
a bunch of files from a given site isn't, 'exactly', the same as having
a robot or search engine spidering the content of your website.

Regardless of wget being invoked by a human or an automated script,
do you think that it might be a "desirable" feature to have wget display
the reason why it isn't downloading files -- i.e. display a warning of
"why" the files were not downloaded, something along the lines of

"automatic recursive retrieval turned off due to restrictions in'robots.txt'"?


Might prevent ignorant-user questions at the very least, but perhaps
the simplest check -- is 'log' output to a "tty" (i.e. same test that
decides on "dot" or "bar" style output) might be a simple test case.
Slightly more work (different "if" conditions) would be deciding if
wget is invoked from an interactive tty or not.

Sorry for so many questions/suggestions.  I don't know how busy/swamped
wget developers are with "real life" work or what time is even available
for any of these changes...  I've "thought" about modifying wget more
than once in the past, but put it off as "much" work to get into, and
was easier to use workarounds (like a win-tool I purchased years ago:
"Teleport-Pro" (http://www.tenmax.com/teleport/pro/features.htm),
that I still get free upgrades on.  Downside is that if I am downloading
Linux files, it's a bit "perverse" to run T-P on my windows machine, that
downloads the file through my linux-server, and then (for my linux files)
gets saved back onto the same server via windows-networking talking

to the linux-based "samba/cifs" server back onto the linux server.

One of the "items" I've often "wondered" about (in terms of difficulty to
include) would be multi-threaded (or multi-process) downloading, where
each process would enqueue parsed files to be downloaded onto a
common queue, and download 'threads' would 'consume' items to
download off of that queue.  There are obvious up and down sides of such
a feature -- especially in terms of _potential_ abuse in terms of
overloading a server, but with "judicious" use, could be a powerful
addition.

However, my involvement in computers came because I was inherently
lazy.  I.e. -- I could program computers to do tasks, once, that would
free me from doing dull repetivive tasks manually :-).  Along the
same lines, I try not to implement my own version of things if I
can find it elsewhere for a low-enough price...(when I bought TP, it
might have been far enough back that it was only $29, vs. the $39 now),
and even in "shareware" mode, it still works for small downloads (# of
files downloaded is limited in shareware mode).

TP does not appear to have the "page-requisite" feature of "wget"....

There are other features missing from TP (not in wget, either as far as
I know) that I would like to see...like the ability to 'merge' output from
separate downloads...i.e.  TP keeps everything in a proprietary database
form, doesn't have ways of deleting Hostname directories or cutting
directory levels off of a tree, etc....these are areas where 'wget' has
better features than TP, but on sites with many "small" files, the overall
download speed can be less than 1/8-1/10th my bandwidth if downloaded
in a single-thread mode due to overhead in waiting for server responses.

Linda


Hrvoje Niksic wrote:

Linda Walsh <[EMAIL PROTECTED]> writes:

But I've tried various combinations to download the "rpms" in the
directory:
wget -r -nH http://mirrors.kernel.org/suse/i386/9.3/suse/i586
wget -r -nH http://mirrors.kernel.org/suse/i386/9.3/suse/i586/.
   (both just download an index.html file)


You need to use `-e robots=off'.

is user using wget a robot? error messsages and etc. (was Re: How to simply download directories (http or ftp))

Reply via email to