On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
My solr index has sources other than the data generated from Nutch crawls.
What this means is that when I do solrDedup from Nutch, the dedup process
will happen across the entire solr Index, not just on the documents
generated and
I'm not suprised that your memory is eaten when fetching almost 10 million
documents! It's a bit tough to read the deduplication code but it looks like
it's hardcoded to fetch all records and split them between maps. If you've got
one map, it'll fetch all records and so eating your memory.
I'm
Thank you so much!. Based on the conversation you are having in another
thread that deals with OutOfmemmory exceptions during SolrDedup, I may
have to investigate deduping on the solr side. My index is 3.2 million
documents and constantly growing at a considerable rate.
Thanks again
Raj
You can try jstack tool
On Wed, Sep 22, 2010 at 12:39 PM, Yavuz Selim YILMAZ
yvzslmyilm...@gmail.com wrote:
If I'm not mistaken it is added to 1.2 which I already use.
Any idea?
--
Yavuz Selim YILMAZ
2010/9/17 Ken Krugler kkrugler_li...@transpac.com
On Sep 17, 2010, at 5:58am, Yavuz
Thanks for the info. I'll give the solr deduplication a try. It looks like
its not as thorough as the regular dedup process (URL, Content, highest
score, shortest URL), but I think it will work.
Brad
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@buyways.nl]
Sent:
Well, I think you can specify a list of fields in SolrConfig.xml during
dedup configuration to control how Solr determines if two documents are
identical. It should be pretty flexible. Correct me of course if I
misunderstood your comment.
-Original Message-
From: brad
So I used to Solr deduping in the end by configuring Solr for Deduping
in SolrConfig.xml. Here is what I ended up doing. I noticed that the
digest field generated by Nutch for the two URLs I mentioned is same.
So I used that as the filed and created new Signature field in the
schma.xml. Here
Hi Folks,
This VOTE has passed (thanks Dennis and Andrzej!).
Here are the tallies:
+1
PMC (binding)
Chris Mattmann
Dennis Kubes
Andrzej Bialecki
I'll push the release out to the mirrors, and update the website docs.
Woo-hoo 1.2 is on its way out the door!
Cheers,
Chris
8 matches
Mail list logo