RE: Duplicate URLs

Nemani, Raj Thu, 23 Sep 2010 15:35:08 -0700

My solr index has sources other than the data generated from Nutch crawls.  
What this means is that when I do solrDedup from Nutch, the dedup process will 
happen across the entire solr Index, not just on the documents generated and 
submitted by Nutch, Am I correct?


Is there a way I can have the deduping done on the Nutch side before sending 
the data set to Solr even if it means I need to generate the Nutch index.  Just 
to reiterate my dupes are based on the content, not on the URL.

On the other hand it looks like you have to supply the Nutch index directory to 
Nutch dedup command, not the segments directory.  Here are the Hadoop log 
entries. Could the documentation be wrong?  Note that I have not generated the 
Nutch index.  After merging the segements and inverting the links, I just 
called the Dedup on my segments directory.  It did not seem to do anything.  Do 
I have to build the Nutch Index and then call the dedup on the segments 
directory?

2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup: starting at 
2010-09-23 17:42:39
2010-09-23 17:42:39,698 INFO  indexer.DeleteDuplicates - Dedup: adding indexes 
in: crawl/segments
2010-09-23 17:42:40,792 WARN  mapred.FileInputFormat - Can't open index at 
file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+2147483647,
 skipping. (no segments* file found in 
org.apache.nutch.indexer.fsdirect...@file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:
 files: [content, crawl_fetch, crawl_generate, crawl_parse, parse_data, 
parse_text])
2010-09-23 17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at 
2010-09-23 17:42:45, elapsed: 00:00:05

Thanks for all your help
Raj



-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 23, 2010 4:52 PM
To: [email protected]
Subject: RE: Duplicate URLs

bin/nutch solrdedup
Usage: SolrDeleteDuplicates <solr url>

 

You could also handle deduplication in your Solr configuration. It exposes more 
options and lets you mark duplicates (documents with identical signatures) or 
overwrite them (deduplicate).

 

http://wiki.apache.org/solr/Deduplication
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:48
To: [email protected]; 
Subject: RE: Duplicate URLs

Thanks again.  One final question.  I do not create Nutch index.  I just push 
the crawl segments to Solr using the follwing command line.  

bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb crawl/segments/*

Do I need to create Nutch index to get the Dedup going because I saw online 
script that submits the nutch Index directory to Dedup command.  Can I just 
pass in the Segments directory (as shown in the document from the link you 
sent) without having to build the Nutch index?

I am going to try both ways in the mean time.

Thanks so much again
Raj


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 23, 2010 4:33 PM
To: [email protected]
Subject: RE: Duplicate URLs

Deduplication is a mechanism where a hash is being generated based on contents 
of some field (title and/or content as the usual). It can be as simple as an 
MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that 
command line option. You can also use Nutch to deduplicate whatever you pushed 
to a Solr index, and you can configure Solr to deduplicate as well.

 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:26
To: [email protected]; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if 
you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 23, 2010 4:20 PM
To: [email protected]
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <[email protected]>
Sent: Thu 23-09-2010 22:12
To: [email protected]; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Reply via email to