Re: Duplicate URLs

Markus Jelsma Fri, 24 Sep 2010 04:31:22 -0700

On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
> My solr index has sources other than the data generated from Nutch crawls. 
>  What this means is that when I do solrDedup from Nutch, the dedup process
>  will happen across the entire solr Index, not just on the documents
>  generated and submitted by Nutch, Am I correct?


Correct.

> 
> Is there a way I can have the deduping done on the Nutch side before
>  sending the data set to Solr even if it means I need to generate the Nutch
>  index.  Just to reiterate my dupes are based on the content, not on the
>  URL.

I'm not sure. You'll need a Nutch index to deduplicate first. But it's the 
index that will be deduplicated, not the parsed segments. Sending stuff to 
Solr then would not be very helpful.

> 
> On the other hand it looks like you have to supply the Nutch index
>  directory to Nutch dedup command, not the segments directory.  Here are
>  the Hadoop log entries. Could the documentation be wrong?  Note that I
>  have not generated the Nutch index.  After merging the segements and
>  inverting the links, I just called the Dedup on my segments directory.  It
>  did not seem to do anything.  Do I have to build the Nutch Index and then
>  call the dedup on the segments directory?

Nutch dedup command required a parameter pointing to an index, you'll need an 
index in Nutch to dedup.

> 
> 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup: starting at
>  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO  indexer.DeleteDuplicates
>  - Dedup: adding indexes in: crawl/segments 2010-09-23 17:42:40,792 WARN 
>  mapred.FileInputFormat - Can't open index at
>  file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+21
> 47483647, skipping. (no segments* file found in
>  org.apache.nutch.indexer.fsdirect...@file:/C:/projects/OpenSource/branch-1
> .2/crawl/segments/20100923174134: files: [content, crawl_fetch,
>  crawl_generate, crawl_parse, parse_data, parse_text]) 2010-09-23
>  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at
>  2010-09-23 17:42:45, elapsed: 00:00:05

That's the segments* doing there?  It shouldn't.

> 
> Thanks for all your help
> Raj
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Thursday, September 23, 2010 4:52 PM
> To: [email protected]
> Subject: RE: Duplicate URLs
> 
> bin/nutch solrdedup
> Usage: SolrDeleteDuplicates <solr url>
> 
>  
> 
> You could also handle deduplication in your Solr configuration. It exposes
>  more options and lets you mark duplicates (documents with identical
>  signatures) or overwrite them (deduplicate).
> 
>  
> 
> http://wiki.apache.org/solr/Deduplication
>  
> -----Original message-----
> From: Nemani, Raj <[email protected]>
> Sent: Thu 23-09-2010 22:48
> To: [email protected];
> Subject: RE: Duplicate URLs
> 
> Thanks again.  One final question.  I do not create Nutch index.  I just
>  push the crawl segments to Solr using the follwing command line.  
> 
> bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
>  crawl/segments/*
> 
> Do I need to create Nutch index to get the Dedup going because I saw online
>  script that submits the nutch Index directory to Dedup command.  Can I
>  just pass in the Segments directory (as shown in the document from the
>  link you sent) without having to build the Nutch index?
> 
> I am going to try both ways in the mean time.
> 
> Thanks so much again
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Thursday, September 23, 2010 4:33 PM
> To: [email protected]
> Subject: RE: Duplicate URLs
> 
> Deduplication is a mechanism where a hash is being generated based on
>  contents of some field (title and/or content as the usual). It can be as
>  simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself
>  by using that command line option. You can also use Nutch to deduplicate
>  whatever you pushed to a Solr index, and you can configure Solr to
>  deduplicate as well.
> 
>  
> 
> http://wiki.apache.org/nutch/CommandLineOptions
> 
>  
> 
> 
>  
> -----Original message-----
> From: Nemani, Raj <[email protected]>
> Sent: Thu 23-09-2010 22:26
> To: [email protected];
> Subject: RE: Duplicate URLs
> 
> Markus,
> 
> Thanks so much.
> Any link that outlines the step to take that you can forward or just
>  explain if you can.  I appreciate your help.  I will keep looking online
>  in the meantime.
> 
> Thanks
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Thursday, September 23, 2010 4:20 PM
> To: [email protected]
> Subject: RE: Duplicate URLs
> 
> Use deduplication.
>  
> -----Original message-----
> From: Nemani, Raj <[email protected]>
> Sent: Thu 23-09-2010 22:12
> To: [email protected];
> Subject: Duplicate URLs
> 
> All,
> 
> 
> 
> I just wanted to see if there is way we can tell Nutch to treat the
> following URLs as same.  
> 
> 
> 
> 
> 
> http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
> _action.htm
> 
> 
> 
> http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
> 
> 
> 
> 
> 
> As you know you can set up web servers such that both the URLs above
> resolve to the same end point.  In other words the two URLs are actually
> *same* even though they are physically different.  Is there anyway I can
> tell NUTCH to treat these URLs as same?
> 
> I cannot use to filtering to ignore one or the other (wither with
> DOMAINNAME or without) because I need to allow both patterns to allow
> genuine URLs.
> 
> 
> 
> Thanks
> 
> Raj
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Duplicate URLs

Reply via email to