Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Dotan Cohen
On Wed, Jul 31, 2013 at 4:56 AM, Bill Bell billnb...@gmail.com wrote: On Jul 30, 2013, at 12:34 PM, Dotan Cohen dotanco...@gmail.com wrote: On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote: Does adding facet.mincount=2 help? In fact, when adding facet.mincount=20 (I

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Mikhail Khludnev
fwiw, this code won't capture uncommitted duplicates. On Wed, Jul 31, 2013 at 9:41 AM, Dotan Cohen dotanco...@gmail.com wrote: On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky j...@basetechnology.com wrote: The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... any

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Jack Krupansky
Good to note! But... any search will not detect dupe IDs for uncommitted documents. -- Jack Krupansky -Original Message- From: Mikhail Khludnev Sent: Wednesday, July 31, 2013 6:11 AM To: solr-user Subject: Re: How might one search for dupe IDs other than faceting on the ID field

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Aloke Ghoshal
Does adding facet.mincount=2 help? On Tue, Jul 30, 2013 at 11:46 PM, Dotan Cohen dotanco...@gmail.com wrote: To search for duplicate IDs, I am running the following query: select?q=*:*facet=truefacet.field=idrows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:16 PM, Dotan Cohen wrote: To search for duplicate IDs, I am running the following query: select?q=*:*facet=truefacet.field=idrows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving OutOfMemoryError errors instead of the desired facet: snip Might there be a

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Michael Della Bitta
Are you talking about the document's ID field? If so, you can't have duplicates... the latter document would overwrite the earlier. If not, sorry for asking irrelevant questions. :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote: Does adding facet.mincount=2 help? In fact, when adding facet.mincount=20 (I know that some dupes are in the hundreds) I got the OutOfMemoryError in seconds instead of minutes. -- Dotan Cohen http://gibberish.co.il

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:23 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Are you talking about the document's ID field? If so, you can't have duplicates... the latter document would overwrite the earlier. If not, sorry for asking irrelevant questions. :) In Solr 4.1 we

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Michael Della Bitta
Since this is a one-time problem, Have you thought of just dumping all the IDs and looking for dupes using sort and awk or something similar to that? Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:43 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Since this is a one-time problem, Have you thought of just dumping all the IDs and looking for dupes using sort and awk or something similar to that? All 100,000,000 of them :) That would take even

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:49 PM, Dotan Cohen wrote: ‎Thanks, the query ran for almost 2 full minutes but it returned results! I'll google for how to increase the disk cache for queries like this. Other than the Qtime, is there no way to judge the amount of memory required for a particular query to run?

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Mikhail Khludnev
Dotan, Could you please provide more line of the stack trace? I have no idea why it made worse at 4.3. I know that 4.3 can use facets backed on DocValues, which are modest for the heap. But from what I saw, but can be wrong it's disabled from numeric facets. Hence, I can suggest to reindex id as

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Jack Krupansky
Message- From: Jack Krupansky Sent: Tuesday, July 30, 2013 4:14 PM To: solr-user@lucene.apache.org Subject: Re: How might one search for dupe IDs other than faceting on the ID field? The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... any particular reason you did

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Bill Bell
This seems like a fairly large issue. Can you create a Jira issue ? Bill Bell Sent from mobile On Jul 30, 2013, at 12:34 PM, Dotan Cohen dotanco...@gmail.com wrote: On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote: Does adding facet.mincount=2 help? In fact,

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:56 PM, Shawn Heisey s...@elyograg.org wrote: On 7/30/2013 12:49 PM, Dotan Cohen wrote: ‎Thanks, the query ran for almost 2 full minutes but it returned results! I'll google for how to increase the disk cache for queries like this. Other than the Qtime, is there no

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Dotan, Could you please provide more line of the stack trace? Sure, thanks: responselst name=errorstr name=msgjava.lang.OutOfMemoryError: Java heap space/strstr name=tracejava.lang.RuntimeException:

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky j...@basetechnology.com wrote: The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... any particular reason you did not use it? See: http://wiki.apache.org/solr/Deduplication and