On Friday 24 September 2010 00:33:54 Nemani, Raj wrote: > My solr index has sources other than the data generated from Nutch crawls. > What this means is that when I do solrDedup from Nutch, the dedup process > will happen across the entire solr Index, not just on the documents > generated and submitted by Nutch, Am I correct?
Correct. > > Is there a way I can have the deduping done on the Nutch side before > sending the data set to Solr even if it means I need to generate the Nutch > index. Just to reiterate my dupes are based on the content, not on the > URL. I'm not sure. You'll need a Nutch index to deduplicate first. But it's the index that will be deduplicated, not the parsed segments. Sending stuff to Solr then would not be very helpful. > > On the other hand it looks like you have to supply the Nutch index > directory to Nutch dedup command, not the segments directory. Here are > the Hadoop log entries. Could the documentation be wrong? Note that I > have not generated the Nutch index. After merging the segements and > inverting the links, I just called the Dedup on my segments directory. It > did not seem to do anything. Do I have to build the Nutch Index and then > call the dedup on the segments directory? Nutch dedup command required a parameter pointing to an index, you'll need an index in Nutch to dedup. > > 2010-09-23 17:42:39,673 INFO indexer.DeleteDuplicates - Dedup: starting at > 2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO indexer.DeleteDuplicates > - Dedup: adding indexes in: crawl/segments 2010-09-23 17:42:40,792 WARN > mapred.FileInputFormat - Can't open index at > file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+21 > 47483647, skipping. (no segments* file found in > org.apache.nutch.indexer.fsdirect...@file:/C:/projects/OpenSource/branch-1 > .2/crawl/segments/20100923174134: files: [content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text]) 2010-09-23 > 17:42:45,200 INFO indexer.DeleteDuplicates - Dedup: finished at > 2010-09-23 17:42:45, elapsed: 00:00:05 That's the segments* doing there? It shouldn't. > > Thanks for all your help > Raj > > > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Thursday, September 23, 2010 4:52 PM > To: [email protected] > Subject: RE: Duplicate URLs > > bin/nutch solrdedup > Usage: SolrDeleteDuplicates <solr url> > > > > You could also handle deduplication in your Solr configuration. It exposes > more options and lets you mark duplicates (documents with identical > signatures) or overwrite them (deduplicate). > > > > http://wiki.apache.org/solr/Deduplication > > -----Original message----- > From: Nemani, Raj <[email protected]> > Sent: Thu 23-09-2010 22:48 > To: [email protected]; > Subject: RE: Duplicate URLs > > Thanks again. One final question. I do not create Nutch index. I just > push the crawl segments to Solr using the follwing command line. > > bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb > crawl/segments/* > > Do I need to create Nutch index to get the Dedup going because I saw online > script that submits the nutch Index directory to Dedup command. Can I > just pass in the Segments directory (as shown in the document from the > link you sent) without having to build the Nutch index? > > I am going to try both ways in the mean time. > > Thanks so much again > Raj > > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Thursday, September 23, 2010 4:33 PM > To: [email protected] > Subject: RE: Duplicate URLs > > Deduplication is a mechanism where a hash is being generated based on > contents of some field (title and/or content as the usual). It can be as > simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself > by using that command line option. You can also use Nutch to deduplicate > whatever you pushed to a Solr index, and you can configure Solr to > deduplicate as well. > > > > http://wiki.apache.org/nutch/CommandLineOptions > > > > > > -----Original message----- > From: Nemani, Raj <[email protected]> > Sent: Thu 23-09-2010 22:26 > To: [email protected]; > Subject: RE: Duplicate URLs > > Markus, > > Thanks so much. > Any link that outlines the step to take that you can forward or just > explain if you can. I appreciate your help. I will keep looking online > in the meantime. > > Thanks > Raj > > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Thursday, September 23, 2010 4:20 PM > To: [email protected] > Subject: RE: Duplicate URLs > > Use deduplication. > > -----Original message----- > From: Nemani, Raj <[email protected]> > Sent: Thu 23-09-2010 22:12 > To: [email protected]; > Subject: Duplicate URLs > > All, > > > > I just wanted to see if there is way we can tell Nutch to treat the > following URLs as same. > > > > > > http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec > _action.htm > > > > http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm > > > > > > As you know you can set up web servers such that both the URLs above > resolve to the same end point. In other words the two URLs are actually > *same* even though they are physically different. Is there anyway I can > tell NUTCH to treat these URLs as same? > > I cannot use to filtering to ignore one or the other (wither with > DOMAINNAME or without) because I need to allow both patterns to allow > genuine URLs. > > > > Thanks > > Raj > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

