Re: robots.txt

2002-06-09 Thread Pike

Hi

 Why not just put robots=off in your .wgetrc?

hey hey
the robots.txt didn't just appear in the website; someone's
put it there and thought about it. what's in there has a good reason.
you might be indexing old, doubled or invalid data, or your
indexing mech might loop on it, or crash the server. who knows.
ask the webmaster or sysadmin before you 'hack' the site.

As for User Agent, most sites like to see a string with WinXX or IE or
Explorer in them.

yes, I would call that 'fraud' :-)

luck,
*-pike

*--
I still maintain the point that designing a monolithic kernel in 1991 is a 
fundamental error. Be thankful you are not my student. You would not get a 
high grade for such a design.

Andrew Tanenbaum to Linus Torvalds




Re: robots.txt

2002-06-09 Thread Jens Rösner

Hi!

  Why not just put robots=off in your .wgetrc?
 hey hey
 the robots.txt didn't just appear in the website; someone's
 put it there and thought about it. what's in there has a good reason.
Wll, from my own experience, the #1 reason is that webmasters 
do not want webgrabbers of any kind to download the site in order to
force 
the visitor to interactively browse the site and thus click
advertisement banners.

 The only reason is 
 you might be indexing old, doubled or invalid data, 
That is cute, someone who believes that all people in the 
internet do what they do to make life easier for everyone.
If you said one reason is or even one reason might be, 
I would not be that cynical, sorry.

 or your indexing mech might loop on it, or crash the server. who knows.
I have yet to find a site which forces wGet into a loop as you said.
Others on the list probably can estimate the theoretical likelyhood of
such events.

 ask the webmaster or sysadmin before you 'hack' the site.
LOL!
hack! Please provide a serious definition of to hack that includes 
automatically downloading pages that could be downloaded with any
interactive web-browser
If the robots.txt said that no user-agent may access the page, you would
be right.
But then: How would anyone know of the existence of this page then?
[rant]
Then again, maybe the page has a high percentage of cgi, JavaScript,
iFrames and thus only allows 
IE 6.0.123b to access the site. Then wget could maybe slow down the
server, especially as it is 
probably a w-ows box : But I ask: Is this a bad thing?
Whuahaha!
[/rant]

Ok, sorry vor my sarcasm, but I think you overestimate the benefits of
robots.txt for mankind.

CU
Jens



Re: interesting bug

2002-06-09 Thread Hack Kampbjørn

[EMAIL PROTECTED] wrote:
 
 I was using wget to suck a website, and found an interesting problem
 some of the URLs it found contained a question mark, after which it
 responded with cannot write to '... insert file/URL here?more
 text  ...'  (invalid argument).
 
 And - it didn't save any of those URLs to files (on my NTFS/windows XP
 machine) ...

It may also have said Illegal filename. Note that not all characters
are allowed in Windows filenames, among them '?'. As '?' is quite common
in data driven web-sites most Windows binaries have included a patch to
deal with it.

The latest wget release 1.8.2 includes now such a patch. But the rest of
illegal characters are not deal with, nor is other special windows
features.
 
 what can I do in order to spider/crawl these pages and save them to my
 local disk ?

Use wget version 1.8.2
 
 Alex

-- 
Med venlig hilsen / Kind regards

Hack Kampbjørn



Re: robots.txt

2002-06-09 Thread rsync

On Sun, Jun 09, 2002 at 09:10:48PM +0200, Jens Rösner wrote:

Hi all of you.

Thank you for the lively discussion.
I learned a lot from this tread.
robots=offdownloaded the site 
and I can now coolly have a look at the text without
having to be on line.

Once again thanks
Mettavihari

 Hi!
 
   Why not just put robots=off in your .wgetrc?
  hey hey
  the robots.txt didn't just appear in the website; someone's
  put it there and thought about it. what's in there has a good reason.
 Wll, from my own experience, the #1 reason is that webmasters 
 do not want webgrabbers of any kind to download the site in order to
 force 
 the visitor to interactively browse the site and thus click
 advertisement banners.
 
  The only reason is 
  you might be indexing old, doubled or invalid data, 
 That is cute, someone who believes that all people in the 
 internet do what they do to make life easier for everyone.
 If you said one reason is or even one reason might be, 
 I would not be that cynical, sorry.
 
  or your indexing mech might loop on it, or crash the server. who knows.
 I have yet to find a site which forces wGet into a loop as you said.
 Others on the list probably can estimate the theoretical likelyhood of
 such events.
 
  ask the webmaster or sysadmin before you 'hack' the site.
 LOL!
 hack! Please provide a serious definition of to hack that includes 
 automatically downloading pages that could be downloaded with any
 interactive web-browser
 If the robots.txt said that no user-agent may access the page, you would
 be right.
 But then: How would anyone know of the existence of this page then?
 [rant]
 Then again, maybe the page has a high percentage of cgi, JavaScript,
 iFrames and thus only allows 
 IE 6.0.123b to access the site. Then wget could maybe slow down the
 server, especially as it is 
 probably a w-ows box : But I ask: Is this a bad thing?
 Whuahaha!
 [/rant]
 
 Ok, sorry vor my sarcasm, but I think you overestimate the benefits of
 robots.txt for mankind.
 
 CU
 Jens

 
A saying of the Buddha from http://metta.lk/ 
 
Verily, misers go not to the celestial realms. Fools do not indeed praise liberality. 
The wise man rejoices in giving and thereby becomes happy thereafter. 
Random Dhammapada Verse 177  
 




Ask advice of using Wget as the basis of a web crawler

2002-06-09 Thread Xuehua Shen

Hi,there,
 I found that wget is a good tool of downloading html
files.Now I want to do a tool similar to web crawler
and want to use the wget as the basis.Now there are
several open source codes of web crawlers such as
websphinx.But I am not sure which one to choose as the
basis?I prefer to wget since many guys like it and the
mailist is so active that I get many supports.
 Any advice?

Xuehua

__
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com