RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Teruhiko Kurosaka
How about introducing these changes in an effort to force the nutch
admins
to properly edit the bot identity strings?
1. Add the http.agent.* entries to nutch-site.xml with the value being
EDITME.
The description should clearly state that these values *must* be
edited
to reflect the true identity of the site.
2. Add a piece of code to the HTTP crawler that checks the
configuration.
If any of the http.agent.* entries are EDITME, the code would log
the error and exit.

-kuro
p.s. I'm subscribing to the digest version of the ML.  If the same or
better idea
has been raised already, please ignore this.




RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Paul Sutter
Michael,

Superb idea! And if those crawls could be distributed through a protocol
like bittorrent, it would spread out the load versus having a single
bottleneck somewhere. I haven't thought it through, but here's some
information (the pdf is the best place to start).

http://www.bittorrent.com/bittorrentecon.pdf
http://www.bittorrent.org/protocol.html

As you mention, trust is an issue. You'd want to prevent people who were not
running nutch from using the service to exchange non-crawl data. You'd also
want to have some kind of trust list that could be maintained by the nutch
community, and by individual nutches, as to whose crawls you'd trust. 

Would you divide up the work by site? Or by a URL hash? Would you exchange
URL lists as well as crawls? 

Anyway, I bet an elegant solution can be crafted.

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 16, 2006 5:52 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Paul Sutter wrote:
 I think that Nutch has to solve the problem: if you leave the problem to
the
 websites, they're more likely to cut you off than they are to implement
 their own index storage scheme. Besides, they'd get it wrong, have stale
 data, etc.
   

agreed
 Maybe what is needed is brainstorming on a shared crawling scheme
 implemented in Nutch. Maybe something based on a bittorrent-like protocol?

   

I am not sure if I understand, can you explain a bit?

What comes to my mind is a server (service) acting as an index 
pointer/referer.

Let's say I have indexed the NYT today then I would notify this server 
about it and also where
the index can be retrieved from.  So somebody else could first contact 
this server and check if
somebody has recently indexed NYT. Of course one would have the problem 
if the index can be trusted


Michi
 incrediBILL seems to have a pretty good point.

 -Original Message-
 From: Michael Wechner [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, June 15, 2006 12:30 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

 Doug Cutting wrote:
   

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
 l 
   
 
 well, I think incrediBILL has an argument, that people might really 
 start excluding bots from their servers if it's
 becoming too much. What might help is that incrediBILL would offer an 
 index of the site, which should be smaller
 than the site itself. I am not sure if there exists a standard for 
 something like this. Basically the bot would ask the
 server if an index exists and where it is located and what the date it 
 is from and then the bot decides to download the index
 or otherwise starts crawling the site.

 Michi

   


-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread Vanderdray, Jacob
That does sound fairly brilliant.  One thing you'll have to keep
in mind is that different plugins index different things and sometimes
the same things in different ways.  You'll need to make sure that crawl
data is labeled with both the plugins used and the versions of each of
the plugins.

Just my 2cents,
Jake.

-Original Message-
From: Paul Sutter [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 16, 2006 2:14 PM
To: nutch-dev@lucene.apache.org
Subject: RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH
Nutch?

Michael,

Superb idea! And if those crawls could be distributed through a protocol
like bittorrent, it would spread out the load versus having a single
bottleneck somewhere. I haven't thought it through, but here's some
information (the pdf is the best place to start).

http://www.bittorrent.com/bittorrentecon.pdf
http://www.bittorrent.org/protocol.html

As you mention, trust is an issue. You'd want to prevent people who were
not
running nutch from using the service to exchange non-crawl data. You'd
also
want to have some kind of trust list that could be maintained by the
nutch
community, and by individual nutches, as to whose crawls you'd trust. 

Would you divide up the work by site? Or by a URL hash? Would you
exchange
URL lists as well as crawls? 

Anyway, I bet an elegant solution can be crafted.

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 16, 2006 5:52 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH
Nutch?

Paul Sutter wrote:
 I think that Nutch has to solve the problem: if you leave the problem
to
the
 websites, they're more likely to cut you off than they are to
implement
 their own index storage scheme. Besides, they'd get it wrong, have
stale
 data, etc.
   

agreed
 Maybe what is needed is brainstorming on a shared crawling scheme
 implemented in Nutch. Maybe something based on a bittorrent-like
protocol?

   

I am not sure if I understand, can you explain a bit?

What comes to my mind is a server (service) acting as an index 
pointer/referer.

Let's say I have indexed the NYT today then I would notify this server 
about it and also where
the index can be retrieved from.  So somebody else could first contact 
this server and check if
somebody has recently indexed NYT. Of course one would have the problem 
if the index can be trusted


Michi
 incrediBILL seems to have a pretty good point.

 -Original Message-
 From: Michael Wechner [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, June 15, 2006 12:30 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH
Nutch?

 Doug Cutting wrote:
   

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch
.htm
 l 
   
 
 well, I think incrediBILL has an argument, that people might really 
 start excluding bots from their servers if it's
 becoming too much. What might help is that incrediBILL would offer an 
 index of the site, which should be smaller
 than the site itself. I am not sure if there exists a standard for 
 something like this. Basically the bot would ask the
 server if an index exists and where it is located and what the date it

 is from and then the bot decides to download the index
 or otherwise starts crawling the site.

 Michi

   


-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread peter decrem
I guess that's the middle of the road approach, with
the two extremes being raw
data and standardized approach.

I agree that we should make some kind of open web
directory or info.  I think a
decentralized approach will make it more difficult to
distribute the data
whereas a centralized exposes users to single point of
failure.  Unless we do
the hybrid suggested earlier, centralized pointers,
decentralized details.

I also somewhere down the road we should allow for
individual users (as opposed
to crawlers to contribute).

In the uk there is an inititative with a centralized
database decentralized
crawlers (actually competition users and teams, and
centralized server provides
the urls to crawel).  It uses a proprietary database. 
I think there should be
some mechanism where contributors also can retrieve
the indices.
  

-Original Message-
From: Vanderdray, Jacob [EMAIL PROTECTED]
Date: Fri, 16 Jun 2006 14:36:03 
To:nutch-dev@lucene.apache.org
Subject: RE: IncrediBILL's Random Rants: How Much
Nutch is TOO MUCH Nutch?

That does sound fairly brilliant.  One thing you'll
have to keep
in mind is that different plugins index different
things and sometimes
the same things in different ways.  You'll need to
make sure that crawl
data is labeled with both the plugins used and the
versions of each of
the plugins.

Just my 2cents,
Jake.

-Original Message-
From: Paul Sutter [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 16, 2006 2:14 PM
To: nutch-dev@lucene.apache.org
Subject: RE: IncrediBILL's Random Rants: How Much
Nutch is TOO MUCH
Nutch?

Michael,

Superb idea! And if those crawls could be distributed
through a protocol
like bittorrent, it would spread out the load versus
having a single
bottleneck somewhere. I haven't thought it through,
but here's some
information (the pdf is the best place to start).

http://www.bittorrent.com/bittorrentecon.pdf
http://www.bittorrent.org/protocol.html

As you mention, trust is an issue. You'd want to
prevent people who were
not
running nutch from using the service to exchange
non-crawl data. You'd
also
want to have some kind of trust list that could be
maintained by the
nutch
community, and by individual nutches, as to whose
crawls you'd trust. 

Would you divide up the work by site? Or by a URL
hash? Would you
exchange
URL lists as well as crawls? 

Anyway, I bet an elegant solution can be crafted.

-Original Message-
From: Michael Wechner
[mailto:[EMAIL PROTECTED] 
Sent: Friday, June 16, 2006 5:52 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much
Nutch is TOO MUCH
Nutch?

Paul Sutter wrote:
 I think that Nutch has to solve the problem: if you
leave the problem
to
the
 websites, they're more likely to cut you off than
they are to
implement
 their own index storage scheme. Besides, they'd get
it wrong, have
stale
 data, etc.
   

agreed
 Maybe what is needed is brainstorming on a shared
crawling scheme
 implemented in Nutch. Maybe something based on a
bittorrent-like
protocol?

   

I am not sure if I understand, can you explain a bit?

What comes to my mind is a server (service) acting as
an index 
pointer/referer.

Let's say I have indexed the NYT today then I would
notify this server 
about it and also where
the index can be retrieved from.  So somebody else
could first contact 
this server and check if
somebody has recently indexed NYT. Of course one would
have the problem 
if the index can be trusted


Michi
 incrediBILL seems to have a pretty good point.

 -Original Message-
 From: Michael Wechner
[mailto:[EMAIL PROTECTED] 
 Sent: Thursday, June 15, 2006 12:30 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: IncrediBILL's Random Rants: How Much
Nutch is TOO MUCH
Nutch?

 Doug Cutting wrote:
   

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch
.htm
 l 
   
 
 well, I think incrediBILL has an argument, that
people might really 
 start excluding bots from their servers if it's
 becoming too much. What might help is that
incrediBILL would offer an 
 index of the site, which should be smaller
 than the site itself. I am not sure if there exists
a standard for 
 something like this. Basically the bot would ask the
 server if an index exists and where it is located
and what the date it

 is from and then the bot decides to download the
index
 or otherwise starts crawling the site.

 Michi

   


-- 
Michael Wechner
Wyona  -   Open Source Content Management   -   
Apache Lenya
http://www.wyona.com 
http://lenya.apache.org
[EMAIL PROTECTED]   
[EMAIL PROTECTED]
+41 44 272 91 61


Reply   |   Reply All   |   Forward 



Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-16 Thread ogjunk-nutch
I was going to suggest the same approach.  Seems simple enough and would force 
the person to edit the config.  What is entered in place of EDITME is another 
story, but maybe some code can enforce some rules on that, too.

Otis

- Original Message 
From: Teruhiko Kurosaka [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Friday, June 16, 2006 2:05:41 PM
Subject: RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

How about introducing these changes in an effort to force the nutch
admins
to properly edit the bot identity strings?
1. Add the http.agent.* entries to nutch-site.xml with the value being
EDITME.
The description should clearly state that these values *must* be
edited
to reflect the true identity of the site.
2. Add a piece of code to the HTTP crawler that checks the
configuration.
If any of the http.agent.* entries are EDITME, the code would log
the error and exit.

-kuro
p.s. I'm subscribing to the digest version of the ML.  If the same or
better idea
has been raised already, please ignore this.







Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Michael Wechner

Doug Cutting wrote:
http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html 



well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index

or otherwise starts crawling the site.

Michi

--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Gal Nitzan
In my company we changed the default and many other probably did the same.
However, we must not ignore the behavior of the irresponsible users of
Nutch. And for that reason the use of the default must be blocked in code.

Just my 2 cents.


-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 9:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 


well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61





RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-15 Thread Paul Sutter
I think that Nutch has to solve the problem: if you leave the problem to the
websites, they're more likely to cut you off than they are to implement
their own index storage scheme. Besides, they'd get it wrong, have stale
data, etc.

Maybe what is needed is brainstorming on a shared crawling scheme
implemented in Nutch. Maybe something based on a bittorrent-like protocol? 

incrediBILL seems to have a pretty good point.

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 12:30 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm
l 


well, I think incrediBILL has an argument, that people might really 
start excluding bots from their servers if it's
becoming too much. What might help is that incrediBILL would offer an 
index of the site, which should be smaller
than the site itself. I am not sure if there exists a standard for 
something like this. Basically the bot would ask the
server if an index exists and where it is located and what the date it 
is from and then the bot decides to download the index
or otherwise starts crawling the site.

Michi

-- 
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61



Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-14 Thread Matt Kangas
Heh. Perhaps we should eliminate the default user-agent string? Then  
he'd have less of a target to aim at... :)


On a more serious note, it seems reasonable to require a customized  
bot URL at least. But publishing an email contact is questionable  
these days. Neither Y! nor G do it, precisely because it will just  
get spammed. (Wait until a spam-bot crawls this blogspot page and  
starts hammering nutch-agent...)



On Jun 14, 2006, at 1:03 PM, Doug Cutting wrote:

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much- 
nutch.html


--
Matt Kangas / [EMAIL PROTECTED]





RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-14 Thread Wootton, Alan
The 'bot blocker' image server at blogspot is broken so it's impossible to 
reply to this blog!

-Original Message-
From: Matt Kangas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 14, 2006 10:38 AM
To: nutch-dev@lucene.apache.org
Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH
Nutch?


Heh. Perhaps we should eliminate the default user-agent string? Then  
he'd have less of a target to aim at... :)

On a more serious note, it seems reasonable to require a customized  
bot URL at least. But publishing an email contact is questionable  
these days. Neither Y! nor G do it, precisely because it will just  
get spammed. (Wait until a spam-bot crawls this blogspot page and  
starts hammering nutch-agent...)


On Jun 14, 2006, at 1:03 PM, Doug Cutting wrote:

 http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much- 
 nutch.html

--
Matt Kangas / [EMAIL PROTECTED]