Re: deduplication of suggester results are not enough

2020-03-26 Thread Michal Hlavac
Hi Roland,

I wrote AnalyzingInfixSuggester that deduplicates data on several levels at 
index time.
I will publish it in few days on github. I'll wrote to this thread when done.

m.

On štvrtok 26. marca 2020 16:01:57 CET Szűcs Roland wrote:
> Hi All,
> 
> I follow the discussion of the suggester related discussions quite a while
> ago. Everybody agrees that it is not the expected behaviour from a
> Suggester where the terms are the entities and not the documents to return
> the same string representation several times.
> 
> One suggestion was to make deduplication on client side of Solr. It is very
> easy in most of the client solution as any set based data structure solve
> this.
> 
> *But one important problem is not solved the deduplication: suggest.count*.
> 
> If I have15 matches by the suggester and the suggest.count=10 and the first
> 9 matches are the same, I will get back only 2 after the deduplication and
> the remaining 5 unique terms will be never shown.
> 
> What is the solution for this?
> 
> Cheers,
> Roland
> 


Re: Deduplication

2015-05-20 Thread Bram Van Dam
 Write a custom update processor and include it in your update chain.
 You will then have the ability to do anything you want with the entire
 input document before it hits the code to actually do the indexing.

This sounded like the perfect option ... until I read Jack's comment:


 My understanding was that the distributed update processor is near the end
 of the chain, so that running of user update processors occurs before the
 distribution step, but is that distribution to the leader, or distribution
 from leader to replicas for a shard?

That would pose some potential problems.

Would a custom update processor make the solution cloud-safe?

Thx,

 - Bram



Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote:
 Hi Bram,
 what do you mean with :
   I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values  .
 
 This is not reduplication, but simple document filtering based on a
 constraint.
 In the case you want de-duplication ( which seemed from your very first
 part of the mail) here you can find a lot of info :

Not sure whether de-duplication is the right word for what I'm after, I
essentially want a unique constraint on an arbitrary field. Without
overwrite semantics, because I want Solr to tell me if a duplicate is
sent to Solr.

I was thinking that the de-duplication feature could accomplish this
somehow.


 - Bram


Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document
in input an Hash ( based on a set of fields).
You can then select two options :
 - Index everything, documents with same signature will be equals
- avoid the overwriting of duplicates.

How the similarity has is calculated is something you can play with and
customise if needed.

Clarified that, do you think can fit in some way, or definitely you are not
talking about deduce ?

2015-05-20 8:37 GMT+01:00 Bram Van Dam bram.van...@intix.eu:

 On 19/05/15 14:47, Alessandro Benedetti wrote:
  Hi Bram,
  what do you mean with :
I
  would like it to provide the unique value myself, without having the
  deduplicator create a hash of field values  .
 
  This is not reduplication, but simple document filtering based on a
  constraint.
  In the case you want de-duplication ( which seemed from your very first
  part of the mail) here you can find a lot of info :

 Not sure whether de-duplication is the right word for what I'm after, I
 essentially want a unique constraint on an arbitrary field. Without
 overwrite semantics, because I want Solr to tell me if a duplicate is
 sent to Solr.

 I was thinking that the de-duplication feature could accomplish this
 somehow.


  - Bram




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam bram.van...@intix.eu wrote:

  Write a custom update processor and include it in your update chain.
  You will then have the ability to do anything you want with the entire
  input document before it hits the code to actually do the indexing.

 This sounded like the perfect option ... until I read Jack's comment:

 
  My understanding was that the distributed update processor is near the
 end
  of the chain, so that running of user update processors occurs before the
  distribution step, but is that distribution to the leader, or
 distribution
  from leader to replicas for a shard?

 That would pose some potential problems.

 Would a custom update processor make the solution cloud-safe?


Starting with Solr 5.1, you have the ability to specify an update processor
on the fly to requests and you can even control whether it is to be
executed before any distribution happens or before it is actually indexed
on the replica.

e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to
have processor xyz run first and then MyCustomUpdateProc and then the
default update processor chain (which will also distribute the doc to the
leader or from the leader to a replica). This also means that such
processors will not be executed on the replicas at all.

You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and
MyCustomUpdateProc to run on each replica (including the leader) right
before the doc is indexed (i.e. just before RunUpdateProcessor)

Unfortunately, due to an oversight, this feature hasn't been documented
well which is something I'll fix. See
https://issues.apache.org/jira/browse/SOLR-6892 for more details.



 Thx,

  - Bram




-- 
Regards,
Shalin Shekhar Mangar.


Re: Deduplication

2015-05-19 Thread Alessandro Benedetti
Hi Bram,
what do you mean with :
  I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values  .

This is not reduplication, but simple document filtering based on a
constraint.
In the case you want de-duplication ( which seemed from your very first
part of the mail) here you can find a lot of info :

https://cwiki.apache.org/confluence/display/solr/De-Duplication

Let me know for more detailed requirements!

2015-05-19 10:02 GMT+01:00 Bram Van Dam bram.van...@intix.eu:

 Hi folks,

 I'm looking for a way to have Solr reject documents if a certain field
 value is duplicated (reject, not overwrite). There doesn't seem to be
 any kind of unique option in schema fields.

 The de-duplication feature seems to make this (somewhat) possible, but I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values.

 Am I missing an obvious (or less obvious) way of accomplishing this?

 Thanks,

  - Bram




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: Deduplication

2015-05-19 Thread Jack Krupansky
Shawn, I was going to say the same thing, but... then I was thinking about
SolrCloud and the fact that update processors are invoked before the
document is set to its target node, so there wouldn't be a reliable way to
tell if the input document field value exists on the target rather than
current node.

Or does the update processing only occur on the leader node after being
forwarded from the originating node? Is the doc clear on this detail?

My understanding was that the distributed update processor is near the end
of the chain, so that running of user update processors occurs before the
distribution step, but is that distribution to the leader, or distribution
from leader to replicas for a shard?


-- Jack Krupansky

On Tue, May 19, 2015 at 9:01 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/19/2015 3:02 AM, Bram Van Dam wrote:
  I'm looking for a way to have Solr reject documents if a certain field
  value is duplicated (reject, not overwrite). There doesn't seem to be
  any kind of unique option in schema fields.
 
  The de-duplication feature seems to make this (somewhat) possible, but I
  would like it to provide the unique value myself, without having the
  deduplicator create a hash of field values.
 
  Am I missing an obvious (or less obvious) way of accomplishing this?

 Write a custom update processor and include it in your update chain.
 You will then have the ability to do anything you want with the entire
 input document before it hits the code to actually do the indexing.

 A script update processor is included with Solr allows you to write your
 processor in a language other than Java, such as javascript.


 https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

 Here's how to discard a document in an update processor written in Java:


 http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor

 The javadoc that I linked above describes the ability to return false
 in other languages to discard the document.

 Thanks,
 Shawn




Re: Deduplication

2015-05-19 Thread Shawn Heisey
On 5/19/2015 3:02 AM, Bram Van Dam wrote:
 I'm looking for a way to have Solr reject documents if a certain field
 value is duplicated (reject, not overwrite). There doesn't seem to be
 any kind of unique option in schema fields.
 
 The de-duplication feature seems to make this (somewhat) possible, but I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values.
 
 Am I missing an obvious (or less obvious) way of accomplishing this?

Write a custom update processor and include it in your update chain.
You will then have the ability to do anything you want with the entire
input document before it hits the code to actually do the indexing.

A script update processor is included with Solr allows you to write your
processor in a language other than Java, such as javascript.

https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

Here's how to discard a document in an update processor written in Java:

http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor

The javadoc that I linked above describes the ability to return false
in other languages to discard the document.

Thanks,
Shawn



RE: Deduplication in SolrCloud

2012-07-27 Thread Markus Jelsma
This issue doesn't really describe your problem but a more general problem of 
distributed deduplication:
https://issues.apache.org/jira/browse/SOLR-3473
 
 
-Original message-
 From:Daniel Brügge daniel.brue...@googlemail.com
 Sent: Fri 27-Jul-2012 17:38
 To: solr-user@lucene.apache.org
 Subject: Deduplication in SolrCloud
 
 Hi,
 
 in my old Solr Setup I have used the deduplication feature in the update
 chain
 with couple of fields.
 
 updateRequestProcessorChain name=dedupe
  processor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
  str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
  str name=fieldsuuid,type,url,content_hash/str
 str
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
  /processor
 processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain
 
 This worked fine. When I now use this in my 2 shards SolrCloud setup when
 inserting 150.000 documents,
 I am always getting an error:
 
 *INFO: end_commit_flush*
 *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
 *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
 unable to create new native thread*
 * at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
 *
 * at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
 *
 
 I am inserting the documents via CSV import and curl command and split them
 also into 50k chunks.
 
 Without the dedupe chain, the import finishes after 40secs.
 
 The curl command writes to one of my shards.
 
 
 Do you have an idea why this happens? Should I reduce the fields to one? I
 have read that not using the id as
 dedupe fields could be an issue?
 
 
 I have searched for deduplication with SolrCloud and I am wondering if it
 is already working correctly? see e.g.
 http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
 
 Thanks  regards
 
 Daniel
 


Re: Deduplication in SolrCloud

2012-07-27 Thread Lance Norskog
Should the old Signature code be removed? Given that the goal is to
have everyone use SolrCloud, maybe this kind of landmine should be
removed?

On Fri, Jul 27, 2012 at 8:43 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 This issue doesn't really describe your problem but a more general problem of 
 distributed deduplication:
 https://issues.apache.org/jira/browse/SOLR-3473


 -Original message-
 From:Daniel Brügge daniel.brue...@googlemail.com
 Sent: Fri 27-Jul-2012 17:38
 To: solr-user@lucene.apache.org
 Subject: Deduplication in SolrCloud

 Hi,

 in my old Solr Setup I have used the deduplication feature in the update
 chain
 with couple of fields.

 updateRequestProcessorChain name=dedupe
  processor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
  str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
  str name=fieldsuuid,type,url,content_hash/str
 str
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
  /processor
 processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 This worked fine. When I now use this in my 2 shards SolrCloud setup when
 inserting 150.000 documents,
 I am always getting an error:

 *INFO: end_commit_flush*
 *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
 *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
 unable to create new native thread*
 * at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
 *
 * at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
 *

 I am inserting the documents via CSV import and curl command and split them
 also into 50k chunks.

 Without the dedupe chain, the import finishes after 40secs.

 The curl command writes to one of my shards.


 Do you have an idea why this happens? Should I reduce the fields to one? I
 have read that not using the id as
 dedupe fields could be an issue?


 I have searched for deduplication with SolrCloud and I am wondering if it
 is already working correctly? see e.g.
 http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

 Thanks  regards

 Daniel




-- 
Lance Norskog
goks...@gmail.com


Re: Deduplication questions

2011-04-11 Thread Chris Hostetter

: Q1. Is is possible to pass *analyzed* content to the
: 
: public abstract class Signature {

No, analysis happens as the documents are being written to the lucene 
index, well after the UpdateProcessors have had a chance to interact with 
the values.

: Q2. Method calculate() is using concatenated fields from str
: name=fieldsname,features,cat/str
: Is there any mechanism I could build  field dependant signatures?

At the moment the Signature API is fairly minimal, but it could definitley 
be improved by adding more methods (that have sensible defaults in the 
base class) that would give the impl more control over teh resulting 
signature ... we just beed people to propose good suggestions with example 
use cases.

: Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)

I don't know that what you describe is really intentional or not, but it 
should work


-Hoss


Re: Deduplication

2010-05-19 Thread Ahmet Arslan

 Basically for some uses cases I would like to show
 duplicates for other I
 wanted them ignored.
 
 If I have overwriteDupes=false and I just create the dedup
 hash how can I
 query for only unique hash values... ie something like a
 SQL group by. 

TermsComponent maybe? 

or faceting? 
q=*:*facet=truefacet.field=signatureFielddefType=lucenerows=0start=0

if you append facet.mincount=1 to above url you can see your duplications


  


Re: Deduplication

2010-05-19 Thread Ahmet Arslan
 TermsComponent maybe? 
 
 or faceting?
 q=*:*facet=truefacet.field=signatureFielddefType=lucenerows=0start=0
 
 if you append facet.mincount=1 to above url you can
 see your duplications
 

After re-reading your message: sometimes you want to show duplicates, sometimes 
you don't want them. I have never used FieldCollapsing by myself but heard 
about it many times.

http://wiki.apache.org/solr/FieldCollapsing


  


Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.

Martijn

2009/11/26 KaktuChakarabati jimmoe...@gmail.com:

 Hey Otis,
 Yep, I realized this myself after playing some with the dedupe feature
 yesterday.
 So it does look like Field collapsing is what I need pretty much.
 Any idea on how close it is to being production-ready?

 Thanks,
 -Chak

 Otis Gospodnetic wrote:

 Hi,

 As far as I know, the point of deduplication in Solr (
 http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
 document before indexing it in order to avoid duplicates in the index in
 the first place.

 What you are describing is closer to field collapsing patch in SOLR-236.

  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4


 Hey,
 I've been trying to find some documentation on using this feature in 1.4
 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:

 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.

 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).

 I enabled the deduplication processor and put it into updater, but i'm
 not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)

 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?

 Anything will be helpful..

 Thanks,
 Chak
 --
 View this message in context:
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Deduplication in 1.4

2009-11-26 Thread Otis Gospodnetic
Hi Martijn,

 
- Original Message 

 From: Martijn v Groningen martijn.is.h...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, November 26, 2009 3:19:40 AM
 Subject: Re: Deduplication in 1.4
 
 Field collapsing has been used by many in their production
 environment. 

Got any pointers to public sites you know use it?  I know of a high traffic 
site that used an early version, and it caused performance problems.  Is 
double-tripping still required?

 The last few months the stability of the patch grew as
 quiet some bugs were fixed. The only big feature missing currently is
 caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

 I will put it in a new patch in the coming next days.  So yes the
 patch is very near being production ready.

Thanks,
Otis

 Martijn
 
 2009/11/26 KaktuChakarabati :
 
  Hey Otis,
  Yep, I realized this myself after playing some with the dedupe feature
  yesterday.
  So it does look like Field collapsing is what I need pretty much.
  Any idea on how close it is to being production-ready?
 
  Thanks,
  -Chak
 
  Otis Gospodnetic wrote:
 
  Hi,
 
  As far as I know, the point of deduplication in Solr (
  http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
  document before indexing it in order to avoid duplicates in the index in
  the first place.
 
  What you are describing is closer to field collapsing patch in SOLR-236.
 
   Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: KaktuChakarabati 
  To: solr-user@lucene.apache.org
  Sent: Tue, November 24, 2009 5:29:00 PM
  Subject: Deduplication in 1.4
 
 
  Hey,
  I've been trying to find some documentation on using this feature in 1.4
  but
  Wiki page is alittle sparse..
  In specific, here's what i'm trying to do:
 
  I have a field, say 'duplicate_group_id' that i'll populate based on some
  offline documents deduplication process I have.
 
  All I want is for solr to compute a 'duplicate_signature' field based on
  this one at update time, so that when i search for documents later, all
  documents with same original 'duplicate_group_id' value will be rolled up
  (e.g i'll just get the first one that came back  according to relevancy).
 
  I enabled the deduplication processor and put it into updater, but i'm
  not
  seeing any difference in returned results (i.e results with same
  duplicate_id are returned separately..)
 
  is there anything i need to supply in query-time for this to take effect?
  what should be the behaviour? is there any working example of this?
 
  Anything will be helpful..
 
  Thanks,
  Chak
  --
  View this message in context:
  http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.

Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic otis_gospodne...@yahoo.com:
 Hi Martijn,


 - Original Message 

 From: Martijn v Groningen martijn.is.h...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, November 26, 2009 3:19:40 AM
 Subject: Re: Deduplication in 1.4

 Field collapsing has been used by many in their production
 environment.

 Got any pointers to public sites you know use it?  I know of a high traffic 
 site that used an early version, and it caused performance problems.  Is 
 double-tripping still required?

 The last few months the stability of the patch grew as
 quiet some bugs were fixed. The only big feature missing currently is
 caching of the collapsing algorithm. I'm currently working on that and

 Is it also full distributed-search-ready?

 I will put it in a new patch in the coming next days.  So yes the
 patch is very near being production ready.

 Thanks,
 Otis

 Martijn

 2009/11/26 KaktuChakarabati :
 
  Hey Otis,
  Yep, I realized this myself after playing some with the dedupe feature
  yesterday.
  So it does look like Field collapsing is what I need pretty much.
  Any idea on how close it is to being production-ready?
 
  Thanks,
  -Chak
 
  Otis Gospodnetic wrote:
 
  Hi,
 
  As far as I know, the point of deduplication in Solr (
  http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
  document before indexing it in order to avoid duplicates in the index in
  the first place.
 
  What you are describing is closer to field collapsing patch in SOLR-236.
 
   Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: KaktuChakarabati
  To: solr-user@lucene.apache.org
  Sent: Tue, November 24, 2009 5:29:00 PM
  Subject: Deduplication in 1.4
 
 
  Hey,
  I've been trying to find some documentation on using this feature in 1.4
  but
  Wiki page is alittle sparse..
  In specific, here's what i'm trying to do:
 
  I have a field, say 'duplicate_group_id' that i'll populate based on some
  offline documents deduplication process I have.
 
  All I want is for solr to compute a 'duplicate_signature' field based on
  this one at update time, so that when i search for documents later, all
  documents with same original 'duplicate_group_id' value will be rolled up
  (e.g i'll just get the first one that came back  according to relevancy).
 
  I enabled the deduplication processor and put it into updater, but i'm
  not
  seeing any difference in returned results (i.e results with same
  duplicate_id are returned separately..)
 
  is there anything i need to supply in query-time for this to take effect?
  what should be the behaviour? is there any working example of this?
 
  Anything will be helpful..
 
  Thanks,
  Chak
  --
  View this message in context:
  http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  View this message in context:
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 




Re: Deduplication in 1.4

2009-11-25 Thread KaktuChakarabati

Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature
yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?

Thanks,
-Chak

Otis Gospodnetic wrote:
 
 Hi,
 
 As far as I know, the point of deduplication in Solr (
 http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
 document before indexing it in order to avoid duplicates in the index in
 the first place.
 
 What you are describing is closer to field collapsing patch in SOLR-236.
 
  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
 - Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4
 
 
 Hey,
 I've been trying to find some documentation on using this feature in 1.4
 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:
 
 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.
 
 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).
 
 I enabled the deduplication processor and put it into updater, but i'm
 not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)
 
 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?
 
 Anything will be helpful..
 
 Thanks,
 Chak
 -- 
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication in 1.4

2009-11-24 Thread Otis Gospodnetic
Hi,

As far as I know, the point of deduplication in Solr ( 
http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document 
before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4
 
 
 Hey,
 I've been trying to find some documentation on using this feature in 1.4 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:
 
 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.
 
 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).
 
 I enabled the deduplication processor and put it into updater, but i'm not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)
 
 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?
 
 Anything will be helpful..
 
 Thanks,
 Chak
 -- 
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-10 Thread Grant Ingersoll
I've seen similar errors when large background merges happen while  
looping in a result set.  See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/




On Jan 9, 2009, at 12:50 PM, Mark Miller wrote:

Your basically writing segments more often now, and somehow avoiding  
a longer merge I think. Also, likely, deduplication is probably  
adding enough extra data to your index to hit a sweet spot where a  
merge is too long. Or something to that effect - MySql is especially  
sensitive to timeouts when doing a select * on a huge db in my  
testing. I didnt understand your answer on the autocommit - I take  
it you are using it? Or no?


All a guess, but it def points to a merge taking a bit long and  
causing a timeout. I think you can relax the MySql timeout settings  
if that is it.


I'd like to get to the bottom of this as well, so any other info you  
can provide would be great.


- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had  
ramBufferSizeMB32/ramBufferSizeMB

and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk  
writting
than I would like... but at least it works fine (but a bit  
slower,opiously).


I keep saying that the most weird thing is that I don't have that  
problem

using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if  
someone can

give me an explanation why this is happening


Shalin Shekhar Mangar wrote:


On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:



hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in  
solrconfig.xml. I

can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do  
the same?





What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got  
closed by

the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and  
what did

you change it to?




--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Regards,
Shalin Shekhar Mangar.










--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/my_db

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
I can't imagine why dedupe would have anything to do with this, other 
than what was said, it perhaps is taking a bit longer to get a document 
to the db, and it times out (maybe a long signature calculation?). Have 
you tried changing your MySql settings to allow for a longer timeout? 
(sorry, I'm not to up to date on what you have tried).


Also, are you using autocommit during the import? If so, you might try 
turning it off for the full import.


- Mark

Marc Sturlese wrote:

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/my_db

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 


java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: 

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

hey there,
I hadn't autoCommit set to true but I have it sorted! The error stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?

Thanks


Marc Sturlese wrote:
 
 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was working perfectly. Now I upgraded to a nigthly build (20th december)
 to be able to use new facet algorithm and other stuff and DeDuplication is
 not working any more. I have followed exactly the same steps to apply the
 patch to the source code. I am geting this error:
 
 WARNING: Error reading data 
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception: 
 
 ** BEGIN NESTED EXCEPTION ** 
 
 java.io.EOFException
 
 STACKTRACE:
 
 java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception: 
 
 ** BEGIN NESTED EXCEPTION ** 
 
 java.io.EOFException
 
 STACKTRACE:
 
 java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Shalin Shekhar Mangar
On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml. I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?



 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey Shalin,

In the begining (when the error was appearing) i had 
ramBufferSizeMB32/ramBufferSizeMB
and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening
 


Shalin Shekhar Mangar wrote:
 
 On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error
 stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml. I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


 What I find strange is this line in the exception:
 Last packet sent to the server was 202481 ms ago.
 
 Something took very very long to complete and the connection got closed by
 the time the next row was fetched from the opened resultset.
 
 Just curious, what was the previous value of maxBufferedDocs and what did
 you change it to?
 
 

 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21376235.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
Your basically writing segments more often now, and somehow avoiding a 
longer merge I think. Also, likely, deduplication is probably adding 
enough extra data to your index to hit a sweet spot where a merge is too 
long. Or something to that effect - MySql is especially sensitive to 
timeouts when doing a select * on a huge db in my testing. I didnt 
understand your answer on the autocommit - I take it you are using it? 
Or no?


All a guess, but it def points to a merge taking a bit long and causing 
a timeout. I think you can relax the MySql timeout settings if that is it.


I'd like to get to the bottom of this as well, so any other info you can 
provide would be great.


- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had 
ramBufferSizeMB32/ramBufferSizeMB

and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening
 



Shalin Shekhar Mangar wrote:
  

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:



hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


  

What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?




--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

--
Regards,
Shalin Shekhar Mangar.





  




Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey Mark,
Sorry I was not enough especific, I wanted to mean that I have and I always
had autoCommit=false.
I will do some more traces and test. Will post if I have any new important
thing to mention.

Thanks.


Marc Sturlese wrote:
 
 Hey Shalin,
 
 In the begining (when the error was appearing) i had 
 ramBufferSizeMB32/ramBufferSizeMB
 and no maxBufferedDocs set
 
 Now I have:
 ramBufferSizeMB32/ramBufferSizeMB
 maxBufferedDocs50/maxBufferedDocs
 
 I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
 than I would like... but at least it works fine (but a bit
 slower,opiously).
 
 I keep saying that the most weird thing is that I don't have that problem
 using solr1.3, just with the nightly...
 
 Even that it's good that it works well now, would be great if someone can
 give me an explanation why this is happening
  
 
 
 Shalin Shekhar Mangar wrote:
 
 On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error
 stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml.
 I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the
 same?


 What I find strange is this line in the exception:
 Last packet sent to the server was 202481 ms ago.
 
 Something took very very long to complete and the connection got closed
 by
 the time the next row was fetched from the opened resultset.
 
 Just curious, what was the previous value of maxBufferedDocs and what did
 you change it to?
 
 

 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21378069.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Thanks I will have a look to my JdbcDataSource. Anyway it's weird because
using the 1.3 release I don't have that problem...

Shalin Shekhar Mangar wrote:
 
 Yes, initially I figured that we are accidentally re-using a closed data
 source. But Noble has pinned it right. I guess you can try looking into
 your
 JDBC driver's documentation for a setting which increases the connection
 alive-ness.
 
 On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 noble.p...@gmail.com wrote:
 
 I guess the indexing of a doc is taking too long (may be because of
 the de-dup patch) and the resultset gets closed automaticallly (timed
 out)
 --Noble

 On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Donig this fix I get the same error :(
 
  I am going to try to set up the last nigthly build... let's see if I
 have
  better luck.
 
  The thing is it stop indexing at the doc num 150.000 aprox... and give
 me
  that mysql exception error... Without DeDuplication patch I can index 2
  milion docs without problems...
 
  I am pretty lost with this... :(
 
 
  Shalin Shekhar Mangar wrote:
 
  Yes I meant the 05/01/2008 build. The fix is a one line change
 
  Add the following as the last line of DataConfig.Entity.clearCache()
  dataSrc = null;
 
 
 
  On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
  marc.sturl...@gmail.comwrote:
 
 
  Shalin you mean I should test the 05/01/2008 nighlty? maybe with this
 one
  works? If the fix you did is not really big can u tell me where in
 the
  source is and what is it for? (I have been debuging and tracing a lot
 the
  dataimporthandler source and I I would like to know what the
 imporovement
  is
  about if it is not a problem...)
 
  Thanks!
 
 
  Shalin Shekhar Mangar wrote:
  
   Marc, I've just committed a fix which may have caused the bug. Can
 you
  use
   svn trunk (or the next nightly build) and confirm?
  
   On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
   noble.p...@gmail.com wrote:
  
   looks like a bug w/ DIH with the recent fixes.
   --Noble
  
   On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
  marc.sturl...@gmail.com
   wrote:
   
Hey there,
I was using the Deduplication patch with Solr 1.3 release and
   everything
   was
working perfectly. Now I upgraded to a nigthly build (20th
 december)
  to
   be
able to use new facet algorithm and other stuff and
 DeDuplication
 is
   not
working any more. I have followed exactly the same steps to
 apply
  the
   patch
to the source code. I am geting this error:
   
WARNING: Error reading data
com.mysql.jdbc.CommunicationsException: Communications link
 failure
  due
   to
underlying exception:
   
** BEGIN NESTED EXCEPTION **
   
java.io.EOFException
   
STACKTRACE:
   
java.io.EOFException
   at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
   at
   com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
   at
  com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
   at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
   at
   
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
   at
   
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
   
   
** END NESTED EXCEPTION **
Last packet sent to the server was 202481 ms ago.
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese


Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect.  I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandler classes... it's
coz I thought the problem was not there (but can't say it for sure...)

I was thinking that the problem has something to do with the
UpdateRequestProcessorChain but don't know how this part of the source
works...

I am really interested in updating to the nightly build as I think new facet
algorithm and  SolrDeletionPolicy are really great stuff!

Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm? 
You meann the last nightly build?

Thanks


Noble Paul നോബിള്‍ नोब्ळ् wrote:
 
 looks like a bug w/ DIH with the recent fixes.
 --Noble
 
 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was
 working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the
 patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
looks like a bug w/ DIH with the recent fixes.
--Noble

On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything was
 working perfectly. Now I upgraded to a nigthly build (20th december) to be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150)
at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488)
at 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess the indexing of a doc is taking too long (may be because of
the de-dup patch) and the resultset gets closed automaticallly (timed
out)
--Noble

On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Donig this fix I get the same error :(

 I am going to try to set up the last nigthly build... let's see if I have
 better luck.

 The thing is it stop indexing at the doc num 150.000 aprox... and give me
 that mysql exception error... Without DeDuplication patch I can index 2
 milion docs without problems...

 I am pretty lost with this... :(


 Shalin Shekhar Mangar wrote:

 Yes I meant the 05/01/2008 build. The fix is a one line change

 Add the following as the last line of DataConfig.Entity.clearCache()
 dataSrc = null;



 On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:


 Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
 works? If the fix you did is not really big can u tell me where in the
 source is and what is it for? (I have been debuging and tracing a lot the
 dataimporthandler source and I I would like to know what the imporovement
 is
 about if it is not a problem...)

 Thanks!


 Shalin Shekhar Mangar wrote:
 
  Marc, I've just committed a fix which may have caused the bug. Can you
 use
  svn trunk (or the next nightly build) and confirm?
 
  On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
  noble.p...@gmail.com wrote:
 
  looks like a bug w/ DIH with the recent fixes.
  --Noble
 
  On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
 marc.sturl...@gmail.com
  wrote:
  
   Hey there,
   I was using the Deduplication patch with Solr 1.3 release and
  everything
  was
   working perfectly. Now I upgraded to a nigthly build (20th december)
 to
  be
   able to use new facet algorithm and other stuff and DeDuplication is
  not
   working any more. I have followed exactly the same steps to apply
 the
  patch
   to the source code. I am geting this error:
  
   WARNING: Error reading data
   com.mysql.jdbc.CommunicationsException: Communications link failure
 due
  to
   underlying exception:
  
   ** BEGIN NESTED EXCEPTION **
  
   java.io.EOFException
  
   STACKTRACE:
  
   java.io.EOFException
  at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  
  
   ** END NESTED EXCEPTION **
   Last packet sent to the server was 202481 ms ago.
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
works? If the fix you did is not really big can u tell me where in the
source is and what is it for? (I have been debuging and tracing a lot the
dataimporthandler source and I I would like to know what the imporovement is
about if it is not a problem...)

Thanks!


Shalin Shekhar Mangar wrote:
 
 Marc, I've just committed a fix which may have caused the bug. Can you use
 svn trunk (or the next nightly build) and confirm?
 
 On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 noble.p...@gmail.com wrote:
 
 looks like a bug w/ DIH with the recent fixes.
 --Noble

 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Hey there,
  I was using the Deduplication patch with Solr 1.3 release and
 everything
 was
  working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
  able to use new facet algorithm and other stuff and DeDuplication is
 not
  working any more. I have followed exactly the same steps to apply the
 patch
  to the source code. I am geting this error:
 
  WARNING: Error reading data
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  java.io.EOFException
 
  STACKTRACE:
 
  java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
  ** END NESTED EXCEPTION **
  Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  Jan 5, 2009 10:06:16 AM
 org.apache.solr.handler.dataimport.JdbcDataSource
  logError
  WARNING: Exception while closing result set
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm?

On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

 looks like a bug w/ DIH with the recent fixes.
 --Noble

 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Hey there,
  I was using the Deduplication patch with Solr 1.3 release and everything
 was
  working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
  able to use new facet algorithm and other stuff and DeDuplication is not
  working any more. I have followed exactly the same steps to apply the
 patch
  to the source code. I am geting this error:
 
  WARNING: Error reading data
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  java.io.EOFException
 
  STACKTRACE:
 
  java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
  ** END NESTED EXCEPTION **
  Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
  logError
  WARNING: Exception while closing result set
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  java.io.EOFException
 
  STACKTRACE:
 
  java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Yes, initially I figured that we are accidentally re-using a closed data
source. But Noble has pinned it right. I guess you can try looking into your
JDBC driver's documentation for a setting which increases the connection
alive-ness.

On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

 I guess the indexing of a doc is taking too long (may be because of
 the de-dup patch) and the resultset gets closed automaticallly (timed
 out)
 --Noble

 On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Donig this fix I get the same error :(
 
  I am going to try to set up the last nigthly build... let's see if I have
  better luck.
 
  The thing is it stop indexing at the doc num 150.000 aprox... and give me
  that mysql exception error... Without DeDuplication patch I can index 2
  milion docs without problems...
 
  I am pretty lost with this... :(
 
 
  Shalin Shekhar Mangar wrote:
 
  Yes I meant the 05/01/2008 build. The fix is a one line change
 
  Add the following as the last line of DataConfig.Entity.clearCache()
  dataSrc = null;
 
 
 
  On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
  marc.sturl...@gmail.comwrote:
 
 
  Shalin you mean I should test the 05/01/2008 nighlty? maybe with this
 one
  works? If the fix you did is not really big can u tell me where in the
  source is and what is it for? (I have been debuging and tracing a lot
 the
  dataimporthandler source and I I would like to know what the
 imporovement
  is
  about if it is not a problem...)
 
  Thanks!
 
 
  Shalin Shekhar Mangar wrote:
  
   Marc, I've just committed a fix which may have caused the bug. Can
 you
  use
   svn trunk (or the next nightly build) and confirm?
  
   On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
   noble.p...@gmail.com wrote:
  
   looks like a bug w/ DIH with the recent fixes.
   --Noble
  
   On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
  marc.sturl...@gmail.com
   wrote:
   
Hey there,
I was using the Deduplication patch with Solr 1.3 release and
   everything
   was
working perfectly. Now I upgraded to a nigthly build (20th
 december)
  to
   be
able to use new facet algorithm and other stuff and DeDuplication
 is
   not
working any more. I have followed exactly the same steps to apply
  the
   patch
to the source code. I am geting this error:
   
WARNING: Error reading data
com.mysql.jdbc.CommunicationsException: Communications link
 failure
  due
   to
underlying exception:
   
** BEGIN NESTED EXCEPTION **
   
java.io.EOFException
   
STACKTRACE:
   
java.io.EOFException
   at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
   at
   com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
   at
  com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
   at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
   at
   
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
   at
   
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
   
   
** END NESTED EXCEPTION **
Last packet sent to the server was 202481 ms ago.
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
   at
   com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
   at
  com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Yes I meant the 05/01/2008 build. The fix is a one line change

Add the following as the last line of DataConfig.Entity.clearCache()
dataSrc = null;



On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
 works? If the fix you did is not really big can u tell me where in the
 source is and what is it for? (I have been debuging and tracing a lot the
 dataimporthandler source and I I would like to know what the imporovement
 is
 about if it is not a problem...)

 Thanks!


 Shalin Shekhar Mangar wrote:
 
  Marc, I've just committed a fix which may have caused the bug. Can you
 use
  svn trunk (or the next nightly build) and confirm?
 
  On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
  noble.p...@gmail.com wrote:
 
  looks like a bug w/ DIH with the recent fixes.
  --Noble
 
  On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
  wrote:
  
   Hey there,
   I was using the Deduplication patch with Solr 1.3 release and
  everything
  was
   working perfectly. Now I upgraded to a nigthly build (20th december)
 to
  be
   able to use new facet algorithm and other stuff and DeDuplication is
  not
   working any more. I have followed exactly the same steps to apply the
  patch
   to the source code. I am geting this error:
  
   WARNING: Error reading data
   com.mysql.jdbc.CommunicationsException: Communications link failure
 due
  to
   underlying exception:
  
   ** BEGIN NESTED EXCEPTION **
  
   java.io.EOFException
  
   STACKTRACE:
  
   java.io.EOFException
  at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
  at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  
  
   ** END NESTED EXCEPTION **
   Last packet sent to the server was 202481 ms ago.
  at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
  at
  
 
 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Donig this fix I get the same error :(

I am going to try to set up the last nigthly build... let's see if I have
better luck.

The thing is it stop indexing at the doc num 150.000 aprox... and give me
that mysql exception error... Without DeDuplication patch I can index 2
milion docs without problems...

I am pretty lost with this... :(


Shalin Shekhar Mangar wrote:
 
 Yes I meant the 05/01/2008 build. The fix is a one line change
 
 Add the following as the last line of DataConfig.Entity.clearCache()
 dataSrc = null;
 
 
 
 On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
 works? If the fix you did is not really big can u tell me where in the
 source is and what is it for? (I have been debuging and tracing a lot the
 dataimporthandler source and I I would like to know what the imporovement
 is
 about if it is not a problem...)

 Thanks!


 Shalin Shekhar Mangar wrote:
 
  Marc, I've just committed a fix which may have caused the bug. Can you
 use
  svn trunk (or the next nightly build) and confirm?
 
  On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
  noble.p...@gmail.com wrote:
 
  looks like a bug w/ DIH with the recent fixes.
  --Noble
 
  On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
 marc.sturl...@gmail.com
  wrote:
  
   Hey there,
   I was using the Deduplication patch with Solr 1.3 release and
  everything
  was
   working perfectly. Now I upgraded to a nigthly build (20th december)
 to
  be
   able to use new facet algorithm and other stuff and DeDuplication is
  not
   working any more. I have followed exactly the same steps to apply
 the
  patch
   to the source code. I am geting this error:
  
   WARNING: Error reading data
   com.mysql.jdbc.CommunicationsException: Communications link failure
 due
  to
   underlying exception:
  
   ** BEGIN NESTED EXCEPTION **
  
   java.io.EOFException
  
   STACKTRACE:
  
   java.io.EOFException
  at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  
  
   ** END NESTED EXCEPTION **
   Last packet sent to the server was 202481 ms ago.
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect.  I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandler classes... it's
coz I thought the problem was not there (but can't say it for sure...)

I was thinking that the problem has something to do with the
UpdateRequestProcessorChain but don't know how this part of the source
works...

Any advice how could I sort it? I am really interested in updating to the
nightly build as I think new facet algorithm and  SolrDeletionPolicy are
really great stuff!

Thanks


Noble Paul നോബിള്‍ नोब्ळ् wrote:
 
 looks like a bug w/ DIH with the recent fixes.
 --Noble
 
 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was
 working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the
 patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
 underlying