RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Paul Sutter
Michael,

Superb idea! And if those crawls could be distributed through a protocol
like bittorrent, it would spread out the load versus having a single
bottleneck somewhere. I haven't thought it through, but here's some
information (the pdf is the best place to start).

http://www.bittorrent.com/bittorrentecon.pdf
http://www.bittorrent.org/protocol.html

As you mention, trust is an issue. You'd want to prevent people who were not
running nutch from using the service to exchange non-crawl data. You'd also
want to have some kind of trust list that could be maintained by the nutch
community, and by individual nutches, as to whose crawls you'd trust. 

Would you divide up the work by site? Or by a URL hash? Would you exchange
URL lists as well as crawls? 

Anyway, I bet an elegant solution can be crafted.

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 16, 2006 5:52 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Paul Sutter wrote:
 I think that Nutch has to solve the problem: if you leave the problem to
the
 websites, they're more likely to cut you off than they are to implement
 their own index storage scheme. Besides, they'd get it wrong, have stale
 data, etc.
   

agreed
 Maybe what is needed is brainstorming on a shared crawling scheme
 implemented in Nutch. Maybe something based on a bittorrent-like protocol?

   

I am not sure if I understand, can you explain a bit?

What comes to my mind is a server (service) acting as an index 
pointer/referer.

Let's say I have indexed the NYT today then I would notify this server 
about it and also where
the index can be retrieved from.  So somebody else could first contact 
this server and check if
somebody has recently indexed NYT. Of course one would have the problem 
if the index can be trusted


Michi
 incrediBILL seems to have a pretty good point.

 -Original Message-
 From: Michael Wechner [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, June 15, 2006 12:30 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

 Doug Cutting wrote:
   

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
 l 
   
 
 well, I think incrediBILL has an argument, that people might really 
 start excluding bots from their servers if it's
 becoming too much. What might help is that incrediBILL would offer an 
 index of the site, which should be smaller
 than the site itself. I am not sure if there exists a standard for 
 something like this. Basically the bot would ask the
 server if an index exists and where it is located and what the date it 
 is from and then the bot decides to download the index
 or otherwise starts crawling the site.

 Michi

   


-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Paul Sutter
I think that Nutch has to solve the problem: if you leave the problem to the
websites, they're more likely to cut you off than they are to implement
their own index storage scheme. Besides, they'd get it wrong, have stale
data, etc.

Maybe what is needed is brainstorming on a shared crawling scheme
implemented in Nutch. Maybe something based on a bittorrent-like protocol? 

incrediBILL seems to have a pretty good point.

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 12:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 


well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61