Re: robots.txt
Hi Why not just put robots=off in your .wgetrc? hey hey the robots.txt didn't just appear in the website; someone's put it there and thought about it. what's in there has a good reason. you might be indexing old, doubled or invalid data, or your indexing mech might loop on it, or crash the server. who knows. ask the webmaster or sysadmin before you 'hack' the site. As for User Agent, most sites like to see a string with WinXX or IE or Explorer in them. yes, I would call that 'fraud' :-) luck, *-pike *-- I still maintain the point that designing a monolithic kernel in 1991 is a fundamental error. Be thankful you are not my student. You would not get a high grade for such a design. Andrew Tanenbaum to Linus Torvalds
Re: robots.txt
Hi! Why not just put robots=off in your .wgetrc? hey hey the robots.txt didn't just appear in the website; someone's put it there and thought about it. what's in there has a good reason. Wll, from my own experience, the #1 reason is that webmasters do not want webgrabbers of any kind to download the site in order to force the visitor to interactively browse the site and thus click advertisement banners. The only reason is you might be indexing old, doubled or invalid data, That is cute, someone who believes that all people in the internet do what they do to make life easier for everyone. If you said one reason is or even one reason might be, I would not be that cynical, sorry. or your indexing mech might loop on it, or crash the server. who knows. I have yet to find a site which forces wGet into a loop as you said. Others on the list probably can estimate the theoretical likelyhood of such events. ask the webmaster or sysadmin before you 'hack' the site. LOL! hack! Please provide a serious definition of to hack that includes automatically downloading pages that could be downloaded with any interactive web-browser If the robots.txt said that no user-agent may access the page, you would be right. But then: How would anyone know of the existence of this page then? [rant] Then again, maybe the page has a high percentage of cgi, JavaScript, iFrames and thus only allows IE 6.0.123b to access the site. Then wget could maybe slow down the server, especially as it is probably a w-ows box : But I ask: Is this a bad thing? Whuahaha! [/rant] Ok, sorry vor my sarcasm, but I think you overestimate the benefits of robots.txt for mankind. CU Jens
Re: interesting bug
[EMAIL PROTECTED] wrote: I was using wget to suck a website, and found an interesting problem some of the URLs it found contained a question mark, after which it responded with cannot write to '... insert file/URL here?more text ...' (invalid argument). And - it didn't save any of those URLs to files (on my NTFS/windows XP machine) ... It may also have said Illegal filename. Note that not all characters are allowed in Windows filenames, among them '?'. As '?' is quite common in data driven web-sites most Windows binaries have included a patch to deal with it. The latest wget release 1.8.2 includes now such a patch. But the rest of illegal characters are not deal with, nor is other special windows features. what can I do in order to spider/crawl these pages and save them to my local disk ? Use wget version 1.8.2 Alex -- Med venlig hilsen / Kind regards Hack Kampbjørn
Re: robots.txt
On Sun, Jun 09, 2002 at 09:10:48PM +0200, Jens Rösner wrote: Hi all of you. Thank you for the lively discussion. I learned a lot from this tread. robots=offdownloaded the site and I can now coolly have a look at the text without having to be on line. Once again thanks Mettavihari Hi! Why not just put robots=off in your .wgetrc? hey hey the robots.txt didn't just appear in the website; someone's put it there and thought about it. what's in there has a good reason. Wll, from my own experience, the #1 reason is that webmasters do not want webgrabbers of any kind to download the site in order to force the visitor to interactively browse the site and thus click advertisement banners. The only reason is you might be indexing old, doubled or invalid data, That is cute, someone who believes that all people in the internet do what they do to make life easier for everyone. If you said one reason is or even one reason might be, I would not be that cynical, sorry. or your indexing mech might loop on it, or crash the server. who knows. I have yet to find a site which forces wGet into a loop as you said. Others on the list probably can estimate the theoretical likelyhood of such events. ask the webmaster or sysadmin before you 'hack' the site. LOL! hack! Please provide a serious definition of to hack that includes automatically downloading pages that could be downloaded with any interactive web-browser If the robots.txt said that no user-agent may access the page, you would be right. But then: How would anyone know of the existence of this page then? [rant] Then again, maybe the page has a high percentage of cgi, JavaScript, iFrames and thus only allows IE 6.0.123b to access the site. Then wget could maybe slow down the server, especially as it is probably a w-ows box : But I ask: Is this a bad thing? Whuahaha! [/rant] Ok, sorry vor my sarcasm, but I think you overestimate the benefits of robots.txt for mankind. CU Jens A saying of the Buddha from http://metta.lk/ Verily, misers go not to the celestial realms. Fools do not indeed praise liberality. The wise man rejoices in giving and thereby becomes happy thereafter. Random Dhammapada Verse 177
Ask advice of using Wget as the basis of a web crawler
Hi,there, I found that wget is a good tool of downloading html files.Now I want to do a tool similar to web crawler and want to use the wget as the basis.Now there are several open source codes of web crawlers such as websphinx.But I am not sure which one to choose as the basis?I prefer to wget since many guys like it and the mailist is so active that I get many supports. Any advice? Xuehua __ Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup http://fifaworldcup.yahoo.com