Re: Duplicate URLs

2010-09-24 Thread Markus Jelsma
On Friday 24 September 2010 00:33:54 Nemani, Raj wrote: My solr index has sources other than the data generated from Nutch crawls. What this means is that when I do solrDedup from Nutch, the dedup process will happen across the entire solr Index, not just on the documents generated and

Re: Nutch 1.2 solrdedup and OutOfMemoryError

2010-09-24 Thread Markus Jelsma
I'm not suprised that your memory is eaten when fetching almost 10 million documents! It's a bit tough to read the deduplication code but it looks like it's hardcoded to fetch all records and split them between maps. If you've got one map, it'll fetch all records and so eating your memory. I'm

RE: Duplicate URLs

2010-09-24 Thread Nemani, Raj
Thank you so much!. Based on the conversation you are having in another thread that deals with OutOfmemmory exceptions during SolrDedup, I may have to investigate deduping on the solr side. My index is 3.2 million documents and constantly growing at a considerable rate. Thanks again Raj

Re: CPU %100

2010-09-24 Thread Alexey Serba
You can try jstack tool On Wed, Sep 22, 2010 at 12:39 PM, Yavuz Selim YILMAZ yvzslmyilm...@gmail.com wrote: If I'm not mistaken it is added to 1.2 which I already use. Any idea? -- Yavuz Selim YILMAZ 2010/9/17 Ken Krugler kkrugler_li...@transpac.com On Sep 17, 2010, at 5:58am, Yavuz

RE: Nutch 1.2 solrdedup and OutOfMemoryError

2010-09-24 Thread brad
Thanks for the info. I'll give the solr deduplication a try. It looks like its not as thorough as the regular dedup process (URL, Content, highest score, shortest URL), but I think it will work. Brad -Original Message- From: Markus Jelsma [mailto:markus.jel...@buyways.nl] Sent:

RE: Nutch 1.2 solrdedup and OutOfMemoryError

2010-09-24 Thread Nemani, Raj
Well, I think you can specify a list of fields in SolrConfig.xml during dedup configuration to control how Solr determines if two documents are identical. It should be pretty flexible. Correct me of course if I misunderstood your comment. -Original Message- From: brad

RE: Duplicate URLs

2010-09-24 Thread Nemani, Raj
So I used to Solr deduping in the end by configuring Solr for Deduping in SolrConfig.xml. Here is what I ended up doing. I noticed that the digest field generated by Nutch for the two URLs I mentioned is same. So I used that as the filed and created new Signature field in the schma.xml. Here

[RESULT] [VOTE] Apache Nutch 1.2 Release Candidate #4

2010-09-24 Thread Mattmann, Chris A (388J)
Hi Folks, This VOTE has passed (thanks Dennis and Andrzej!). Here are the tallies: +1 PMC (binding) Chris Mattmann Dennis Kubes Andrzej Bialecki I'll push the release out to the mirrors, and update the website docs. Woo-hoo 1.2 is on its way out the door! Cheers, Chris