RE: URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
Easiest is to set the signature to TextProfileSignature and delete duplicates 
from the index, but you will still crawl and waste resources on them. Or are 
you by any change trying to prevent spider traps from being crawled?

 
 
-Original message-
 From:Madan Patil madan...@usc.edu
 Sent: Wednesday 18th February 2015 21:58
 To: user user@nutch.apache.org
 Subject: Re: URL filter plugins for nutch
 
 Hi Markus,
 
 I am looking for the one's with similar in content.
 
 Regards,
 Madan Patil
 
 On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  By near-duplicate you mean similar URL's, or URL's with similar content?
 
  -Original message-
   From:Madan Patil madan...@usc.edu
   Sent: Wednesday 18th February 2015 21:10
   To: user user@nutch.apache.org
   Subject: URL filter plugins for nutch
  
   Hi,
  
   I am working on assignment where I am supposed to use nutch to crawl
   antractic data.
   I am writing a plugin which extends URLFilter to not crawl duplicate
  (exact
   and near duplicate) URLs. All the plugins, the defaults ones and others
  on
   web, have only one URL. They decide what to do or not to do based on
   content of one URL.
  
   Could any one point me to resources which would help me compare content
  of
   one URL with the ones already crawled?
  
   Thanks in advance.
  
   Regards,
   Madan Patil
  
 
 


RE: URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
Hi - this is not going to work. URLFilter interface operates on single URL's 
only, it is not aware of content, it is not aware of possible metadata 
(simhash) attached to the CrawlDatum. It would be more straightforward to 
implement Signature and calculate the simhash there. Now, Nutch has a 
DeduplicationJob but it it operates on equals signatures as mapreduce key, and 
this is not going to work with simhashes. I remember there was a trick to get 
similar hashes in the same key buckets by emitting them to multiple buckets 
from the mapper, so then in the reducer you can do a sorensen similarity on the 
hashes.

This is really tricky stuff, especially getting the hashes in the same bucket.

Are you doing this for removing duplicates from search results? Then it might 
be more easier to implement the sorensen similarity in a custom Lucene 
collector. Because the top docs contain duplicates, the pass through the same 
collector implementation and a single point to remove them. The problem now is 
that it wont really work with distributed search, unless you hash similar URL's 
to the same shard, but the cluster now becomes unbalanced and difficult to 
manage, plus that IDF and norms become skewed.

Good luck, we have tried many different approaches to this problem, especially 
online deduplication. But offline is also hard because of reducer keys.

Markus

 
 
-Original message-
 From:Madan Patil madan...@usc.edu
 Sent: Wednesday 18th February 2015 22:16
 To: user user@nutch.apache.org
 Subject: Re: URL filter plugins for nutch
 
 I am not sure if I understand you right. But here is what I am trying to
 implement,
 
 I have implemented Charikar's simhash and now want to use it to detect
 near-duplicates/duplicates.
 I would like to make it a plugin(which implements URLFilter interface).
 Hence filter all those URLs, whose content is nearly same as one which have
 alrady been fetched. Would this be possible or I am heading in wrong
 direction.
 
 Thanks for your patience Markus.
 
 
 Regards,
 Madan Patil
 
 On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  Easiest is to set the signature to TextProfileSignature and delete
  duplicates from the index, but you will still crawl and waste resources on
  them. Or are you by any change trying to prevent spider traps from being
  crawled?
 
 
 
  -Original message-
   From:Madan Patil madan...@usc.edu
   Sent: Wednesday 18th February 2015 21:58
   To: user user@nutch.apache.org
   Subject: Re: URL filter plugins for nutch
  
   Hi Markus,
  
   I am looking for the one's with similar in content.
  
   Regards,
   Madan Patil
  
   On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma 
  markus.jel...@openindex.io
   wrote:
  
By near-duplicate you mean similar URL's, or URL's with similar
  content?
   
-Original message-
 From:Madan Patil madan...@usc.edu
 Sent: Wednesday 18th February 2015 21:10
 To: user user@nutch.apache.org
 Subject: URL filter plugins for nutch

 Hi,

 I am working on assignment where I am supposed to use nutch to crawl
 antractic data.
 I am writing a plugin which extends URLFilter to not crawl duplicate
(exact
 and near duplicate) URLs. All the plugins, the defaults ones and
  others
on
 web, have only one URL. They decide what to do or not to do based on
 content of one URL.

 Could any one point me to resources which would help me compare
  content
of
 one URL with the ones already crawled?

 Thanks in advance.

 Regards,
 Madan Patil

   
  
 
 


Re: URL filter plugins for nutch

2015-02-18 Thread Madan Patil
​​This is really tricky given the familiarity with nutch, I have. I will
try with sorensen as you suggested.
Thanks for the input Markus.

Regards,
Madan Patil

On Wed, Feb 18, 2015 at 1:28 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - this is not going to work. URLFilter interface operates on single
 URL's only, it is not aware of content, it is not aware of possible
 metadata (simhash) attached to the CrawlDatum. It would be more
 straightforward to implement Signature and calculate the simhash there.
 Now, Nutch has a DeduplicationJob but it it operates on equals signatures
 as mapreduce key, and this is not going to work with simhashes. I remember
 there was a trick to get similar hashes in the same key buckets by emitting
 them to multiple buckets from the mapper, so then in the reducer you can do
 a sorensen similarity on the hashes.

 This is really tricky stuff, especially getting the hashes in the same
 bucket.

 Are you doing this for removing duplicates from search results? Then it
 might be more easier to implement the sorensen similarity in a custom
 Lucene collector. Because the top docs contain duplicates, the pass through
 the same collector implementation and a single point to remove them. The
 problem now is that it wont really work with distributed search, unless you
 hash similar URL's to the same shard, but the cluster now becomes
 unbalanced and difficult to manage, plus that IDF and norms become skewed.

 Good luck, we have tried many different approaches to this problem,
 especially online deduplication. But offline is also hard because of
 reducer keys.

 Markus



 -Original message-
  From:Madan Patil madan...@usc.edu
  Sent: Wednesday 18th February 2015 22:16
  To: user user@nutch.apache.org
  Subject: Re: URL filter plugins for nutch
 
  I am not sure if I understand you right. But here is what I am trying to
  implement,
 
  I have implemented Charikar's simhash and now want to use it to detect
  near-duplicates/duplicates.
  I would like to make it a plugin(which implements URLFilter interface).
  Hence filter all those URLs, whose content is nearly same as one which
 have
  alrady been fetched. Would this be possible or I am heading in wrong
  direction.
 
  Thanks for your patience Markus.
 
 
  Regards,
  Madan Patil
 
  On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   Easiest is to set the signature to TextProfileSignature and delete
   duplicates from the index, but you will still crawl and waste
 resources on
   them. Or are you by any change trying to prevent spider traps from
 being
   crawled?
  
  
  
   -Original message-
From:Madan Patil madan...@usc.edu
Sent: Wednesday 18th February 2015 21:58
To: user user@nutch.apache.org
Subject: Re: URL filter plugins for nutch
   
Hi Markus,
   
I am looking for the one's with similar in content.
   
Regards,
Madan Patil
   
On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma 
   markus.jel...@openindex.io
wrote:
   
 By near-duplicate you mean similar URL's, or URL's with similar
   content?

 -Original message-
  From:Madan Patil madan...@usc.edu
  Sent: Wednesday 18th February 2015 21:10
  To: user user@nutch.apache.org
  Subject: URL filter plugins for nutch
 
  Hi,
 
  I am working on assignment where I am supposed to use nutch to
 crawl
  antractic data.
  I am writing a plugin which extends URLFilter to not crawl
 duplicate
 (exact
  and near duplicate) URLs. All the plugins, the defaults ones and
   others
 on
  web, have only one URL. They decide what to do or not to do
 based on
  content of one URL.
 
  Could any one point me to resources which would help me compare
   content
 of
  one URL with the ones already crawled?
 
  Thanks in advance.
 
  Regards,
  Madan Patil
 

   
  
 



Re: about indexing to multiple solr servers

2015-02-18 Thread Lewis John Mcgibbney
Hi Eyeris,

On Wed, Feb 18, 2015 at 12:10 PM, user-digest-h...@nutch.apache.org wrote:


 I have a question and sorry if it is a trivial things.
 Is there any way to index in multiple solr server (at least 2) using nutch
 1.9 ?

 I have configured solr with one master and 2 slaves, but i need 2 master
 and 2 slaves, the problem is how nutch can index in more than 1 solr.
 If i have only one solr master and it fails it is a problem.

 Any advice or post will be accepted.


There are a couple of open issues with patches you can reference.
https://issues.apache.org/jira/browse/NUTCH-1480
https://issues.apache.org/jira/browse/NUTCH-945
I think actually re-writing some of these patches to apply to trunk 1.10
would be really fantastic.
If you make an progress on this then please comment on this issues.
Thanks
Lewis


URL filter plugins for nutch

2015-02-18 Thread Madan Patil
Hi,

I am working on assignment where I am supposed to use nutch to crawl
antractic data.
I am writing a plugin which extends URLFilter to not crawl duplicate (exact
and near duplicate) URLs. All the plugins, the defaults ones and others on
web, have only one URL. They decide what to do or not to do based on
content of one URL.

Could any one point me to resources which would help me compare content of
one URL with the ones already crawled?

Thanks in advance.

Regards,
Madan Patil


Re: URL filter plugins for nutch

2015-02-18 Thread Madan Patil
Hi Markus,

I am looking for the one's with similar in content.

Regards,
Madan Patil

On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 By near-duplicate you mean similar URL's, or URL's with similar content?

 -Original message-
  From:Madan Patil madan...@usc.edu
  Sent: Wednesday 18th February 2015 21:10
  To: user user@nutch.apache.org
  Subject: URL filter plugins for nutch
 
  Hi,
 
  I am working on assignment where I am supposed to use nutch to crawl
  antractic data.
  I am writing a plugin which extends URLFilter to not crawl duplicate
 (exact
  and near duplicate) URLs. All the plugins, the defaults ones and others
 on
  web, have only one URL. They decide what to do or not to do based on
  content of one URL.
 
  Could any one point me to resources which would help me compare content
 of
  one URL with the ones already crawled?
 
  Thanks in advance.
 
  Regards,
  Madan Patil
 



Re: URL filter plugins for nutch

2015-02-18 Thread Madan Patil
I am not sure if I understand you right. But here is what I am trying to
implement,

I have implemented Charikar's simhash and now want to use it to detect
near-duplicates/duplicates.
I would like to make it a plugin(which implements URLFilter interface).
Hence filter all those URLs, whose content is nearly same as one which have
alrady been fetched. Would this be possible or I am heading in wrong
direction.

Thanks for your patience Markus.


Regards,
Madan Patil

On Wed, Feb 18, 2015 at 1:05 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Easiest is to set the signature to TextProfileSignature and delete
 duplicates from the index, but you will still crawl and waste resources on
 them. Or are you by any change trying to prevent spider traps from being
 crawled?



 -Original message-
  From:Madan Patil madan...@usc.edu
  Sent: Wednesday 18th February 2015 21:58
  To: user user@nutch.apache.org
  Subject: Re: URL filter plugins for nutch
 
  Hi Markus,
 
  I am looking for the one's with similar in content.
 
  Regards,
  Madan Patil
 
  On Wed, Feb 18, 2015 at 12:53 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   By near-duplicate you mean similar URL's, or URL's with similar
 content?
  
   -Original message-
From:Madan Patil madan...@usc.edu
Sent: Wednesday 18th February 2015 21:10
To: user user@nutch.apache.org
Subject: URL filter plugins for nutch
   
Hi,
   
I am working on assignment where I am supposed to use nutch to crawl
antractic data.
I am writing a plugin which extends URLFilter to not crawl duplicate
   (exact
and near duplicate) URLs. All the plugins, the defaults ones and
 others
   on
web, have only one URL. They decide what to do or not to do based on
content of one URL.
   
Could any one point me to resources which would help me compare
 content
   of
one URL with the ones already crawled?
   
Thanks in advance.
   
Regards,
Madan Patil
   
  
 



RE: URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
By near-duplicate you mean similar URL's, or URL's with similar content? 
 
-Original message-
 From:Madan Patil madan...@usc.edu
 Sent: Wednesday 18th February 2015 21:10
 To: user user@nutch.apache.org
 Subject: URL filter plugins for nutch
 
 Hi,
 
 I am working on assignment where I am supposed to use nutch to crawl
 antractic data.
 I am writing a plugin which extends URLFilter to not crawl duplicate (exact
 and near duplicate) URLs. All the plugins, the defaults ones and others on
 web, have only one URL. They decide what to do or not to do based on
 content of one URL.
 
 Could any one point me to resources which would help me compare content of
 one URL with the ones already crawled?
 
 Thanks in advance.
 
 Regards,
 Madan Patil
 


Re: [MASSMAIL]URL filter plugins for nutch

2015-02-18 Thread Jorge Luis Betancourt González
The idea behind the URL filter plugins is to decide weather the current URL 
(string) should be allowed to be fetched or not, in your particular case I 
think that you could try to read the LinkDB and then decide if you want to 
fetch or not this particular URL, keep in mind that this logic should be 
something fast, because is going to be executed a lot of times (one for each 
URL). 

I don't know of any plugin that does this, typically this is kind of hard to do 
right (if possible at all), but you can check out the LinkDbReader for a way to 
read from LinkDB to do your check. One more detail if you only filter by the 
URL you can find resources on the Web where the content has changed and in this 
case you will discard the fetching of the resource. 

Regards,

- Original Message -
From: Madan Patil madan...@usc.edu
To: user user@nutch.apache.org
Sent: Wednesday, February 18, 2015 3:09:00 PM
Subject: [MASSMAIL]URL filter plugins for nutch

Hi,

I am working on assignment where I am supposed to use nutch to crawl
antractic data.
I am writing a plugin which extends URLFilter to not crawl duplicate (exact
and near duplicate) URLs. All the plugins, the defaults ones and others on
web, have only one URL. They decide what to do or not to do based on
content of one URL.

Could any one point me to resources which would help me compare content of
one URL with the ones already crawled?

Thanks in advance.

Regards,
Madan Patil


NUTCH-762 Generate Multiple Segments

2015-02-18 Thread Meraj A. Khan
Hi Folks,

I am facing the exact same problem that is described in JIRA NUTCH-762
, i.e the generate -updates takes excessive amount of time and the
actual fetch only takes very less time compared to the generate time.

The Jira issue commits a patch to allow generating of multiple
segments in a single generate phase, however I was not able to do so .

How can I generate multiple segments in a single generate phase ? I am
using Nutch 1.7 , any help would be greatly appreciated. I am using
Nutc 1.7 on YARN 2.3.0

Thanks.


Re: [MASSMAIL]RE: [MASSMAIL]URL filter plugins for nutch

2015-02-18 Thread Jorge Luis Betancourt González
Well send my response before the rest of the thread developed, but my idea was 
not to store the hash in the LinkDB, but to search the current URL being 
filtered in the LinkDB to see if it was fetched before, the first portion of 
the Madan's email stated that he was implementing an URL filter which acts 
solely on the URL (string) there for my kind of partial-ish recommendation, as 
for accessing the content to compute the hash an URL filter can not be used, 
mainly because in this stage the URL hasn't been fetched and is not aware of 
any content yet, as you explained in your email. 

Sorry if my out-of-time email caused some confusion :)

Regards,

- Original Message -
From: Markus Jelsma markus.jel...@openindex.io
To: user@nutch.apache.org
Sent: Wednesday, February 18, 2015 6:31:51 PM
Subject: [MASSMAIL]RE: [MASSMAIL]URL filter plugins for nutch

Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content 
derived information such as similarity hashes, nor does it cluster similar 
URL's as you would want when detecting spider traps. 

What do you think?

Markus 
 
-Original message-
 From:Jorge Luis Betancourt González jlbetanco...@uci.cu
 Sent: Wednesday 18th February 2015 23:05
 To: user@nutch.apache.org
 Subject: Re: [MASSMAIL]URL filter plugins for nutch
 
 The idea behind the URL filter plugins is to decide weather the current URL 
 (string) should be allowed to be fetched or not, in your particular case I 
 think that you could try to read the LinkDB and then decide if you want to 
 fetch or not this particular URL, keep in mind that this logic should be 
 something fast, because is going to be executed a lot of times (one for each 
 URL). 
 
 I don't know of any plugin that does this, typically this is kind of hard to 
 do right (if possible at all), but you can check out the LinkDbReader for a 
 way to read from LinkDB to do your check. One more detail if you only filter 
 by the URL you can find resources on the Web where the content has changed 
 and in this case you will discard the fetching of the resource. 
 
 Regards,
 
 - Original Message -
 From: Madan Patil madan...@usc.edu
 To: user user@nutch.apache.org
 Sent: Wednesday, February 18, 2015 3:09:00 PM
 Subject: [MASSMAIL]URL filter plugins for nutch
 
 Hi,
 
 I am working on assignment where I am supposed to use nutch to crawl
 antractic data.
 I am writing a plugin which extends URLFilter to not crawl duplicate (exact
 and near duplicate) URLs. All the plugins, the defaults ones and others on
 web, have only one URL. They decide what to do or not to do based on
 content of one URL.
 
 Could any one point me to resources which would help me compare content of
 one URL with the ones already crawled?
 
 Thanks in advance.
 
 Regards,
 Madan Patil



RE: [MASSMAIL]URL filter plugins for nutch

2015-02-18 Thread Markus Jelsma
Hi Jorge - perhaps i am missing something, but the linkdb cannot hold content 
derived information such as similarity hashes, nor does it cluster similar 
URL's as you would want when detecting spider traps. 

What do you think?

Markus 
 
-Original message-
 From:Jorge Luis Betancourt González jlbetanco...@uci.cu
 Sent: Wednesday 18th February 2015 23:05
 To: user@nutch.apache.org
 Subject: Re: [MASSMAIL]URL filter plugins for nutch
 
 The idea behind the URL filter plugins is to decide weather the current URL 
 (string) should be allowed to be fetched or not, in your particular case I 
 think that you could try to read the LinkDB and then decide if you want to 
 fetch or not this particular URL, keep in mind that this logic should be 
 something fast, because is going to be executed a lot of times (one for each 
 URL). 
 
 I don't know of any plugin that does this, typically this is kind of hard to 
 do right (if possible at all), but you can check out the LinkDbReader for a 
 way to read from LinkDB to do your check. One more detail if you only filter 
 by the URL you can find resources on the Web where the content has changed 
 and in this case you will discard the fetching of the resource. 
 
 Regards,
 
 - Original Message -
 From: Madan Patil madan...@usc.edu
 To: user user@nutch.apache.org
 Sent: Wednesday, February 18, 2015 3:09:00 PM
 Subject: [MASSMAIL]URL filter plugins for nutch
 
 Hi,
 
 I am working on assignment where I am supposed to use nutch to crawl
 antractic data.
 I am writing a plugin which extends URLFilter to not crawl duplicate (exact
 and near duplicate) URLs. All the plugins, the defaults ones and others on
 web, have only one URL. They decide what to do or not to do based on
 content of one URL.
 
 Could any one point me to resources which would help me compare content of
 one URL with the ones already crawled?
 
 Thanks in advance.
 
 Regards,
 Madan Patil