RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
How about introducing these changes in an effort to force the nutch admins to properly edit the bot identity strings? 1. Add the http.agent.* entries to nutch-site.xml with the value being EDITME. The description should clearly state that these values *must* be edited to reflect the true identity of the site. 2. Add a piece of code to the HTTP crawler that checks the configuration. If any of the http.agent.* entries are EDITME, the code would log the error and exit. -kuro p.s. I'm subscribing to the digest version of the ML. If the same or better idea has been raised already, please ignore this.
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
Michael, Superb idea! And if those crawls could be distributed through a protocol like bittorrent, it would spread out the load versus having a single bottleneck somewhere. I haven't thought it through, but here's some information (the pdf is the best place to start). http://www.bittorrent.com/bittorrentecon.pdf http://www.bittorrent.org/protocol.html As you mention, trust is an issue. You'd want to prevent people who were not running nutch from using the service to exchange non-crawl data. You'd also want to have some kind of trust list that could be maintained by the nutch community, and by individual nutches, as to whose crawls you'd trust. Would you divide up the work by site? Or by a URL hash? Would you exchange URL lists as well as crawls? Anyway, I bet an elegant solution can be crafted. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Friday, June 16, 2006 5:52 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Paul Sutter wrote: I think that Nutch has to solve the problem: if you leave the problem to the websites, they're more likely to cut you off than they are to implement their own index storage scheme. Besides, they'd get it wrong, have stale data, etc. agreed Maybe what is needed is brainstorming on a shared crawling scheme implemented in Nutch. Maybe something based on a bittorrent-like protocol? I am not sure if I understand, can you explain a bit? What comes to my mind is a server (service) acting as an index pointer/referer. Let's say I have indexed the NYT today then I would notify this server about it and also where the index can be retrieved from. So somebody else could first contact this server and check if somebody has recently indexed NYT. Of course one would have the problem if the index can be trusted Michi incrediBILL seems to have a pretty good point. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 12:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
That does sound fairly brilliant. One thing you'll have to keep in mind is that different plugins index different things and sometimes the same things in different ways. You'll need to make sure that crawl data is labeled with both the plugins used and the versions of each of the plugins. Just my 2cents, Jake. -Original Message- From: Paul Sutter [mailto:[EMAIL PROTECTED] Sent: Friday, June 16, 2006 2:14 PM To: nutch-dev@lucene.apache.org Subject: RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Michael, Superb idea! And if those crawls could be distributed through a protocol like bittorrent, it would spread out the load versus having a single bottleneck somewhere. I haven't thought it through, but here's some information (the pdf is the best place to start). http://www.bittorrent.com/bittorrentecon.pdf http://www.bittorrent.org/protocol.html As you mention, trust is an issue. You'd want to prevent people who were not running nutch from using the service to exchange non-crawl data. You'd also want to have some kind of trust list that could be maintained by the nutch community, and by individual nutches, as to whose crawls you'd trust. Would you divide up the work by site? Or by a URL hash? Would you exchange URL lists as well as crawls? Anyway, I bet an elegant solution can be crafted. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Friday, June 16, 2006 5:52 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Paul Sutter wrote: I think that Nutch has to solve the problem: if you leave the problem to the websites, they're more likely to cut you off than they are to implement their own index storage scheme. Besides, they'd get it wrong, have stale data, etc. agreed Maybe what is needed is brainstorming on a shared crawling scheme implemented in Nutch. Maybe something based on a bittorrent-like protocol? I am not sure if I understand, can you explain a bit? What comes to my mind is a server (service) acting as an index pointer/referer. Let's say I have indexed the NYT today then I would notify this server about it and also where the index can be retrieved from. So somebody else could first contact this server and check if somebody has recently indexed NYT. Of course one would have the problem if the index can be trusted Michi incrediBILL seems to have a pretty good point. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 12:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch .htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
I guess that's the middle of the road approach, with the two extremes being raw data and standardized approach. I agree that we should make some kind of open web directory or info. I think a decentralized approach will make it more difficult to distribute the data whereas a centralized exposes users to single point of failure. Unless we do the hybrid suggested earlier, centralized pointers, decentralized details. I also somewhere down the road we should allow for individual users (as opposed to crawlers to contribute). In the uk there is an inititative with a centralized database decentralized crawlers (actually competition users and teams, and centralized server provides the urls to crawel). It uses a proprietary database. I think there should be some mechanism where contributors also can retrieve the indices. -Original Message- From: Vanderdray, Jacob [EMAIL PROTECTED] Date: Fri, 16 Jun 2006 14:36:03 To:nutch-dev@lucene.apache.org Subject: RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? That does sound fairly brilliant. One thing you'll have to keep in mind is that different plugins index different things and sometimes the same things in different ways. You'll need to make sure that crawl data is labeled with both the plugins used and the versions of each of the plugins. Just my 2cents, Jake. -Original Message- From: Paul Sutter [mailto:[EMAIL PROTECTED] Sent: Friday, June 16, 2006 2:14 PM To: nutch-dev@lucene.apache.org Subject: RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Michael, Superb idea! And if those crawls could be distributed through a protocol like bittorrent, it would spread out the load versus having a single bottleneck somewhere. I haven't thought it through, but here's some information (the pdf is the best place to start). http://www.bittorrent.com/bittorrentecon.pdf http://www.bittorrent.org/protocol.html As you mention, trust is an issue. You'd want to prevent people who were not running nutch from using the service to exchange non-crawl data. You'd also want to have some kind of trust list that could be maintained by the nutch community, and by individual nutches, as to whose crawls you'd trust. Would you divide up the work by site? Or by a URL hash? Would you exchange URL lists as well as crawls? Anyway, I bet an elegant solution can be crafted. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Friday, June 16, 2006 5:52 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Paul Sutter wrote: I think that Nutch has to solve the problem: if you leave the problem to the websites, they're more likely to cut you off than they are to implement their own index storage scheme. Besides, they'd get it wrong, have stale data, etc. agreed Maybe what is needed is brainstorming on a shared crawling scheme implemented in Nutch. Maybe something based on a bittorrent-like protocol? I am not sure if I understand, can you explain a bit? What comes to my mind is a server (service) acting as an index pointer/referer. Let's say I have indexed the NYT today then I would notify this server about it and also where the index can be retrieved from. So somebody else could first contact this server and check if somebody has recently indexed NYT. Of course one would have the problem if the index can be trusted Michi incrediBILL seems to have a pretty good point. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 12:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch .htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED] [EMAIL PROTECTED] +41 44 272 91 61 Reply | Reply All | Forward
Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
I was going to suggest the same approach. Seems simple enough and would force the person to edit the config. What is entered in place of EDITME is another story, but maybe some code can enforce some rules on that, too. Otis - Original Message From: Teruhiko Kurosaka [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Friday, June 16, 2006 2:05:41 PM Subject: RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? How about introducing these changes in an effort to force the nutch admins to properly edit the bot identity strings? 1. Add the http.agent.* entries to nutch-site.xml with the value being EDITME. The description should clearly state that these values *must* be edited to reflect the true identity of the site. 2. Add a piece of code to the HTTP crawler that checks the configuration. If any of the http.agent.* entries are EDITME, the code would log the error and exit. -kuro p.s. I'm subscribing to the digest version of the ML. If the same or better idea has been raised already, please ignore this.
Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
In my company we changed the default and many other probably did the same. However, we must not ignore the behavior of the irresponsible users of Nutch. And for that reason the use of the default must be blocked in code. Just my 2 cents. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 9:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
I think that Nutch has to solve the problem: if you leave the problem to the websites, they're more likely to cut you off than they are to implement their own index storage scheme. Besides, they'd get it wrong, have stale data, etc. Maybe what is needed is brainstorming on a shared crawling scheme implemented in Nutch. Maybe something based on a bittorrent-like protocol? incrediBILL seems to have a pretty good point. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 12:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
Heh. Perhaps we should eliminate the default user-agent string? Then he'd have less of a target to aim at... :) On a more serious note, it seems reasonable to require a customized bot URL at least. But publishing an email contact is questionable these days. Neither Y! nor G do it, precisely because it will just get spammed. (Wait until a spam-bot crawls this blogspot page and starts hammering nutch-agent...) On Jun 14, 2006, at 1:03 PM, Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much- nutch.html -- Matt Kangas / [EMAIL PROTECTED]
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
The 'bot blocker' image server at blogspot is broken so it's impossible to reply to this blog! -Original Message- From: Matt Kangas [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 14, 2006 10:38 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Heh. Perhaps we should eliminate the default user-agent string? Then he'd have less of a target to aim at... :) On a more serious note, it seems reasonable to require a customized bot URL at least. But publishing an email contact is questionable these days. Neither Y! nor G do it, precisely because it will just get spammed. (Wait until a spam-bot crawls this blogspot page and starts hammering nutch-agent...) On Jun 14, 2006, at 1:03 PM, Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much- nutch.html -- Matt Kangas / [EMAIL PROTECTED]