subject:"Deduplication"

I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. 
I followed:


http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.


What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number


---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str

  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /


--

If I look at the created field dedupeHash it seems to be empty...!?

Re: Config issue for deduplication

2010-05-13 Thread Ahmet Arslan

 I am trying to configure automatic
 deduplication for SOLR 1.4 in Vufind. I followed:
 
 http://wiki.apache.org/solr/Deduplication
 
 Actually nothing happens. All records are being imported
 without any deduplication.

Does being imported means you are using dataimporthandler? If yes you can use 
this to enable DIH with dedupe.

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.processordedupe/str
/lst
/requestHandler

Re: Config issue for deduplication

Hmm, I can't find in solrconfig.xml anything about dataimporthandler for 
Vufind.


So I suppose, no the import function does not use this method. Import is 
done by a script.


Maybe I do not associate

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

with the correct requestHandler?

I placed it directly after

requestHandler name=/update class=solr.XmlUpdateRequestHandler /

So kind of having twice this line.

Markus

Ahmet Arslan schrieb:

I am trying to configure automatic
deduplication for SOLR 1.4 in Vufind. I followed:

http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported
without any deduplication.


Does being imported means you are using dataimporthandler? If yes you can use 
this to enable DIH with dedupe.

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.processordedupe/str
/lst
/requestHandler

RE: Config issue for deduplication

2010-05-13 Thread Markus Jelsma

What's your solrconfig? No deduplication is overwritesDedupes = false and 
signature field is other than doc ID field (unique) 
 
-Original message-
From: Markus Fischer i...@flyingfischer.ch
Sent: Thu 13-05-2010 17:01
To: solr-user@lucene.apache.org; 
Subject: Config issue for deduplication

I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. 
I followed:

http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.

What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number

---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
     str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /

--

If I look at the created field dedupeHash it seems to be empty...!?

Re: Config issue for deduplication


I use

bool name=overwriteDupestrue/bool

and a different field than ID to control duplication. This is about 
bibliographic data coming from different sources with different IDs 
which may have the same content...


I attached solrconfig.xml if you want to take a look.

Thanks a lot!

Markus

Markus Jelsma schrieb:
What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) 
 
-Original message-

From: Markus Fischer i...@flyingfischer.ch
Sent: Thu 13-05-2010 17:01
To: solr-user@lucene.apache.org; 
Subject: Config issue for deduplication


I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. 
I followed:


http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.


What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number


---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str

  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /


--

If I look at the created field dedupeHash it seems to be empty...!?

?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

config
  !-- Set this to 'false' if you want solr to continue working after it has 
   encountered an severe configuration error.  In a production environment, 
   you may want solr to keep working even if one handler is mis-configured.

   You may also set this to false using by setting the system property:
 -Dsolr.abortOnConfigurationError=false
 --
  abortOnConfigurationError${solr.abortOnConfigurationError:false}/abortOnConfigurationError

  !-- Used to specify an alternate directory to hold all index data
   other than the default ./data under the Solr home.
   If replication is in use, this should match the replication configuration. --
  dataDir${solr.solr.home:./solr}/biblio/dataDir


  indexDefaults
   !-- Values here affect all index writers and act as a default unless overridden. --
useCompoundFilefalse/useCompoundFile

mergeFactor10/mergeFactor
!--
 If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.

 --
!--maxBufferedDocs1000/maxBufferedDocs--
!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at the cost of more RAM

If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.

--
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout

!--
 Expert: Turn on Lucene's auto commit capability.

 TODO: Add recommendations on why you would want to do this.

 NOTE: Despite the name, this value does not have any relation to Solr's autoCommit functionality

 --
!--luceneAutoCommitfalse/luceneAutoCommit--
!--
 Expert:
 The Merge Policy in Lucene controls how merging is handled by Lucene.  The default in 2.3 is the LogByteSizeMergePolicy, previous
 versions used LogDocMergePolicy.

 LogByteSizeMergePolicy chooses segments to merge based on their size.  The Lucene 2.2 default, LogDocMergePolicy chose when
 to merge based on number of documents

 Other implementations of MergePolicy must have a no-argument constructor
 --
!--mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy

Re: [resolved] Config issue for deduplication


Got it with the help of Demian Katz, main developper of Vufind:

The import script of Vufind was bypassing the duplication parameters 
while writing directly to the SOLR-Index.


By deactivitating direct writing to the index and using the standard way 
it now works!


Thanks to all who gave input!

Markus

Markus Fischer schrieb:

I use

bool name=overwriteDupestrue/bool

and a different field than ID to control duplication. This is about 
bibliographic data coming from different sources with different IDs 
which may have the same content...


I attached solrconfig.xml if you want to take a look.

Thanks a lot!

Markus

Markus Jelsma schrieb:
What's your solrconfig? No deduplication is overwritesDedupes = false 
and signature field is other than doc ID field (unique)  
-Original message-

From: Markus Fischer i...@flyingfischer.ch
Sent: Thu 13-05-2010 17:01
To: solr-user@lucene.apache.org; Subject: Config issue for deduplication

I am trying to configure automatic deduplication for SOLR 1.4 in 
Vufind. I followed:


http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.


What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number


---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str 


  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /


--

If I look at the created field dedupeHash it seems to be empty...!?

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-02 Thread Bill Engle

Thanks for the responses.  This is exactly what I had to resort to.  I will
definitely put in a feature request to get the generated ID back from the
extract request.

I am doing this with PHP cURL for extraction and pecl php solr for
querying.  I am then saving the unique id and dupe hash in a MySQL table
which I check against after the doc is indexed in Solr.  If it is a dupe I
delete the Solr record and discard the file.  My problem now is the dupe
hash sometimes comes back NULL from Solr although when I check it through
Solr Admin it is there.  I am working through this now to isolate.

I had to set Solr to ALLOW duplicates because I have to somehow know that
the file is a dupe and then remove the duplicate files on my filesystem.
Based on the extract response I have no way of knowing this if duplicates
are disallowed.

-Bill


On Tue, Mar 2, 2010 at 2:11 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:



 : To quote from the wiki,
...
 That's all true ... but Bill explicitly said he wanted to use
 SignatureUpdateProcessorFactory to generate a uniqueKey from the content
 field post-extraction so he could dedup documents with the same content
 ... his question was how to get that key after adding a doc.

 Using a unique literal.field value will work -- but only as the value of
 a secondary field that he can then query on to get the uniqueKeyField
 value.


 :  : You could create your own unique ID and pass it in with the
 :  : literal.field=value feature.
 : 
 :  By which Lance means you could specify an unique value in a differnet
 :  field from yoru uniqueKey field, and then query on that field:value
 pair
 :  to get the doc after it's been added -- but that query will only work
 :  until some other version of the doc (with some other value) overwrites
 it.
 :  so you'd esentially have to query for the field:value to lookup the
 :  uniqueKey.
 : 
 :  it seems like it should definitely be feasible for the
 :  Update RequestHandlers to return the uniqueKeyField values for all the
 :  added docs (regardless of wether the key was included in the request,
 or
 :  added by an UpdateProcessor -- but i'm not sure how that would fit in
 with
 :  the SolrJ API.
 : 
 :  would you mind opening a feature request in Jira?
 : 
 : 
 : 
 :  -Hoss
 : 
 : 
 :
 :
 :
 : --
 : Lance Norskog
 : goks...@gmail.com
 :



 -Hoss

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter


: You could create your own unique ID and pass it in with the
: literal.field=value feature.

By which Lance means you could specify an unique value in a differnet 
field from yoru uniqueKey field, and then query on that field:value pair 
to get the doc after it's been added -- but that query will only work 
until some other version of the doc (with some other value) overwrites it.  
so you'd esentially have to query for the field:value to lookup the 
uniqueKey.

it seems like it should definitely be feasible for the 
Update RequestHandlers to return the uniqueKeyField values for all the 
added docs (regardless of wether the key was included in the request, or 
added by an UpdateProcessor -- but i'm not sure how that would fit in with 
the SolrJ API.

would you mind opening a feature request in Jira?



-Hoss

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Lance Norskog

To quote from the wiki,
http://wiki.apache.org/solr/ExtractingRequestHandler

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1commit=true'
-F myfi...@tutorial.html

This runs the extractor on your input file (in this case an HTML
file). It then stores the generated document with the id field (the
uniqueKey declared in schema.xml) set to 'doc1'. This way, you do not
rely on the ExtractingRequestHandler to create a unique key for you.
This command throws away that generated key.

On Mon, Mar 1, 2010 at 4:22 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : You could create your own unique ID and pass it in with the
 : literal.field=value feature.

 By which Lance means you could specify an unique value in a differnet
 field from yoru uniqueKey field, and then query on that field:value pair
 to get the doc after it's been added -- but that query will only work
 until some other version of the doc (with some other value) overwrites it.
 so you'd esentially have to query for the field:value to lookup the
 uniqueKey.

 it seems like it should definitely be feasible for the
 Update RequestHandlers to return the uniqueKeyField values for all the
 added docs (regardless of wether the key was included in the request, or
 added by an UpdateProcessor -- but i'm not sure how that would fit in with
 the SolrJ API.

 would you mind opening a feature request in Jira?



 -Hoss





-- 
Lance Norskog
goks...@gmail.com

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter



: To quote from the wiki,
...
That's all true ... but Bill explicitly said he wanted to use 
SignatureUpdateProcessorFactory to generate a uniqueKey from the content 
field post-extraction so he could dedup documents with the same content 
... his question was how to get that key after adding a doc.

Using a unique literal.field value will work -- but only as the value of 
a secondary field that he can then query on to get the uniqueKeyField 
value.


:  : You could create your own unique ID and pass it in with the
:  : literal.field=value feature.
: 
:  By which Lance means you could specify an unique value in a differnet
:  field from yoru uniqueKey field, and then query on that field:value pair
:  to get the doc after it's been added -- but that query will only work
:  until some other version of the doc (with some other value) overwrites it.
:  so you'd esentially have to query for the field:value to lookup the
:  uniqueKey.
: 
:  it seems like it should definitely be feasible for the
:  Update RequestHandlers to return the uniqueKeyField values for all the
:  added docs (regardless of wether the key was included in the request, or
:  added by an UpdateProcessor -- but i'm not sure how that would fit in with
:  the SolrJ API.
: 
:  would you mind opening a feature request in Jira?
: 
: 
: 
:  -Hoss
: 
: 
: 
: 
: 
: -- 
: Lance Norskog
: goks...@gmail.com
: 



-Hoss

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Bill Engle

Any thoughts on this? I would like to get the id back in the request after
indexing. My initial thoughts were to do a search to get the docid based
on the attr_stream_name after indexing but now that I reread my message I
mentioned the attr_stream_name (file_name) may be different so that is
unreliable. My only option is to somehow return the id in the XML
response. Any guidance is greatly appreciated.

-Bill

On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote:

Hi -

New Solr user here. I am using Solr Cell to index files (PDF, doc, docx,
txt, htm, etc.) and there is a good chance that a new file will have
duplicate content but not necessarily the same file name. To avoid this I
am using the deduplication feature of Solr.

updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
bool name=enabledtrue/bool
str name=signatureFieldid/str
bool name=overwriteDupestrue/bool
str name=fieldsattr_content/str
str name=signatureClassorg.apache.solr.update.processor./str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

How do I get the id value post Solr processing. Is there someway to
modify the curl response so that id is returned. I need this id because I
would like to rename the file to the id value. I could probably do a Solr
search after the fact to get the id field based on the attr_stream_name but
I would like to do only one request.

curl '
http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
-F myfi...@myfile.pdf

Thanks,
Bill

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Lance Norskog

You could create your own unique ID and pass it in with the
literal.field=value feature.

http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters

On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle billengle...@gmail.com wrote:
Any thoughts on this? I would like to get the id back in the request after
indexing. My initial thoughts were to do a search to get the docid based
on the attr_stream_name after indexing but now that I reread my message I
mentioned the attr_stream_name (file_name) may be different so that is
unreliable. My only option is to somehow return the id in the XML
response. Any guidance is greatly appreciated.

-Bill

On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote:

Hi -

curl '
http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
-F myfi...@myfile.pdf

Thanks,
Bill

--
Lance Norskog
goks...@gmail.com

Solr Cell and Deduplication - Get ID of doc

2010-02-24 Thread Bill Engle

Hi -

New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
txt, htm, etc.) and there is a good chance that a new file will have
duplicate content but not necessarily the same file name.  To avoid this I
am using the deduplication feature of Solr.

  updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldid/str
  bool name=overwriteDupestrue/bool
  str name=fieldsattr_content/str
  str name=signatureClassorg.apache.solr.update.processor./str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

How do I get the id value post Solr processing.  Is there someway to
modify the curl response so that id is returned.  I need this id because I
would like to rename the file to the id value.  I could probably do a Solr
search after the fact to get the id field based on the attr_stream_name but
I would like to do only one request.

curl '
http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
-F myfi...@myfile.pdf

Thanks,
Bill

Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen

Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.

Martijn

2009/11/26 KaktuChakarabati jimmoe...@gmail.com:

 Hey Otis,
 Yep, I realized this myself after playing some with the dedupe feature
 yesterday.
 So it does look like Field collapsing is what I need pretty much.
 Any idea on how close it is to being production-ready?

 Thanks,
 -Chak

 Otis Gospodnetic wrote:

 Hi,

 As far as I know, the point of deduplication in Solr (
 http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
 document before indexing it in order to avoid duplicates in the index in
 the first place.

 What you are describing is closer to field collapsing patch in SOLR-236.

  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4


 Hey,
 I've been trying to find some documentation on using this feature in 1.4
 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:

 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.

 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).

 I enabled the deduplication processor and put it into updater, but i'm
 not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)

 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?

 Anything will be helpful..

 Thanks,
 Chak
 --
 View this message in context:
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication in 1.4

2009-11-26 Thread Otis Gospodnetic

Hi Martijn,

- Original Message 

 From: Martijn v Groningen martijn.is.h...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, November 26, 2009 3:19:40 AM
 Subject: Re: Deduplication in 1.4

 Field collapsing has been used by many in their production
 environment. 

Got any pointers to public sites you know use it?  I know of a high traffic 
site that used an early version, and it caused performance problems.  Is 
double-tripping still required?

 The last few months the stability of the patch grew as
 quiet some bugs were fixed. The only big feature missing currently is
 caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

 I will put it in a new patch in the coming next days.  So yes the
 patch is very near being production ready.

Thanks,
Otis

 Martijn

 2009/11/26 KaktuChakarabati :

  Hey Otis,
  Yep, I realized this myself after playing some with the dedupe feature
  yesterday.
  So it does look like Field collapsing is what I need pretty much.
  Any idea on how close it is to being production-ready?

  Thanks,
  -Chak

  Otis Gospodnetic wrote:

  Hi,

  As far as I know, the point of deduplication in Solr (
  http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
  document before indexing it in order to avoid duplicates in the index in
  the first place.

  What you are describing is closer to field collapsing patch in SOLR-236.

   Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR

  - Original Message 
  From: KaktuChakarabati 
  To: solr-user@lucene.apache.org
  Sent: Tue, November 24, 2009 5:29:00 PM
  Subject: Deduplication in 1.4

  Hey,
  I've been trying to find some documentation on using this feature in 1.4
  but
  Wiki page is alittle sparse..
  In specific, here's what i'm trying to do:

  I have a field, say 'duplicate_group_id' that i'll populate based on some
  offline documents deduplication process I have.

  All I want is for solr to compute a 'duplicate_signature' field based on
  this one at update time, so that when i search for documents later, all
  documents with same original 'duplicate_group_id' value will be rolled up
  (e.g i'll just get the first one that came back  according to relevancy).

  I enabled the deduplication processor and put it into updater, but i'm
  not
  seeing any difference in returned results (i.e results with same
  duplicate_id are returned separately..)

  is there anything i need to supply in query-time for this to take effect?
  what should be the behaviour? is there any working example of this?

  Anything will be helpful..

  Thanks,
  Chak
  --
  View this message in context:
  http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
  Sent from the Solr - User mailing list archive at Nabble.com.

  --
  View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen

Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.

Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic otis_gospodne...@yahoo.com:
 Hi Martijn,


 - Original Message 

 From: Martijn v Groningen martijn.is.h...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, November 26, 2009 3:19:40 AM
 Subject: Re: Deduplication in 1.4

 Field collapsing has been used by many in their production
 environment.

 Got any pointers to public sites you know use it?  I know of a high traffic 
 site that used an early version, and it caused performance problems.  Is 
 double-tripping still required?

 The last few months the stability of the patch grew as
 quiet some bugs were fixed. The only big feature missing currently is
 caching of the collapsing algorithm. I'm currently working on that and

 Is it also full distributed-search-ready?

 I will put it in a new patch in the coming next days.  So yes the
 patch is very near being production ready.

 Thanks,
 Otis

 Martijn

 2009/11/26 KaktuChakarabati :
 
  Hey Otis,
  Yep, I realized this myself after playing some with the dedupe feature
  yesterday.
  So it does look like Field collapsing is what I need pretty much.
  Any idea on how close it is to being production-ready?
 
  Thanks,
  -Chak
 
  Otis Gospodnetic wrote:
 
  Hi,
 
  As far as I know, the point of deduplication in Solr (
  http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
  document before indexing it in order to avoid duplicates in the index in
  the first place.
 
  What you are describing is closer to field collapsing patch in SOLR-236.
 
   Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: KaktuChakarabati
  To: solr-user@lucene.apache.org
  Sent: Tue, November 24, 2009 5:29:00 PM
  Subject: Deduplication in 1.4
 
 
  Hey,
  I've been trying to find some documentation on using this feature in 1.4
  but
  Wiki page is alittle sparse..
  In specific, here's what i'm trying to do:
 
  I have a field, say 'duplicate_group_id' that i'll populate based on some
  offline documents deduplication process I have.
 
  All I want is for solr to compute a 'duplicate_signature' field based on
  this one at update time, so that when i search for documents later, all
  documents with same original 'duplicate_group_id' value will be rolled up
  (e.g i'll just get the first one that came back  according to relevancy).
 
  I enabled the deduplication processor and put it into updater, but i'm
  not
  seeing any difference in returned results (i.e results with same
  duplicate_id are returned separately..)
 
  is there anything i need to supply in query-time for this to take effect?
  what should be the behaviour? is there any working example of this?
 
  Anything will be helpful..
 
  Thanks,
  Chak
  --
  View this message in context:
  http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  View this message in context:
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
  Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication in 1.4

2009-11-25 Thread KaktuChakarabati


Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature
yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?

Thanks,
-Chak

Otis Gospodnetic wrote:
 
 Hi,
 
 As far as I know, the point of deduplication in Solr (
 http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
 document before indexing it in order to avoid duplicates in the index in
 the first place.
 
 What you are describing is closer to field collapsing patch in SOLR-236.
 
  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
 - Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4
 
 
 Hey,
 I've been trying to find some documentation on using this feature in 1.4
 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:
 
 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.
 
 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).
 
 I enabled the deduplication processor and put it into updater, but i'm
 not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)
 
 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?
 
 Anything will be helpful..
 
 Thanks,
 Chak
 -- 
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
Sent from the Solr - User mailing list archive at Nabble.com.

Deduplication in 1.4

2009-11-24 Thread KaktuChakarabati


Hey,
I've been trying to find some documentation on using this feature in 1.4 but
Wiki page is alittle sparse..
In specific, here's what i'm trying to do:

I have a field, say 'duplicate_group_id' that i'll populate based on some
offline documents deduplication process I have.

All I want is for solr to compute a 'duplicate_signature' field based on
this one at update time, so that when i search for documents later, all
documents with same original 'duplicate_group_id' value will be rolled up
(e.g i'll just get the first one that came back  according to relevancy).

I enabled the deduplication processor and put it into updater, but i'm not
seeing any difference in returned results (i.e results with same
duplicate_id are returned separately..)

is there anything i need to supply in query-time for this to take effect?
what should be the behaviour? is there any working example of this?

Anything will be helpful..

Thanks,
Chak
-- 
View this message in context: 
http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication in 1.4

2009-11-24 Thread Otis Gospodnetic

Hi,

As far as I know, the point of deduplication in Solr ( 
http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document 
before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4
 
 
 Hey,
 I've been trying to find some documentation on using this feature in 1.4 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:
 
 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.
 
 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).
 
 I enabled the deduplication processor and put it into updater, but i'm not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)
 
 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?
 
 Anything will be helpful..
 
 Thanks,
 Chak
 -- 
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Conditional deduplication

2009-09-30 Thread Michael

If I index a bunch of email documents, is there a way to sayshow me all
email documents, but only one per To: email address
so that if there are a total of 10 distinct To: fields in the corpus, I get
back 10 email documents?

I'm aware of http://wiki.apache.org/solr/Deduplication but I want to retain
the ability to search across all of my email documents most of the time, and
only occasionally search for the distinct ones.

Essentially I want to do a
SELECT DISTINCT to_field FROM documents
where a normal search is a
SELECT * FROM documents

Thanks for any pointers.

Re: Conditional deduplication

2009-09-30 Thread Mauricio Scheffer

See http://wiki.apache.org/solr/FieldCollapsing

On Wed, Sep 30, 2009 at 4:41 PM, Michael solrco...@gmail.com wrote:

 If I index a bunch of email documents, is there a way to sayshow me all
 email documents, but only one per To: email address
 so that if there are a total of 10 distinct To: fields in the corpus, I get
 back 10 email documents?

 I'm aware of http://wiki.apache.org/solr/Deduplication but I want to
 retain
 the ability to search across all of my email documents most of the time,
 and
 only occasionally search for the distinct ones.

 Essentially I want to do a
 SELECT DISTINCT to_field FROM documents
 where a normal search is a
 SELECT * FROM documents

 Thanks for any pointers.

Re: stress tests to DIH and deduplication patch

2009-04-30 Thread Marc Sturlese


I have already ran out of memory after a cronjob indexing as much times as
possible during a day.
Will activate GC loggin to see what it says...
Thnks!


Shalin Shekhar Mangar wrote:
 
 On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 Hey there, I am doing some stress tests indexing with DIH.
 I am indexing a mysql DB with 140 rows aprox. I am using also the
 DeDuplication patch.
 I am using tomcat with JVM limit of -Xms2000M -Xmx2000M
 I have indexed 3 times using full-import command without restarting
 tomcat
 or reloading the core between the indexations.
 I have used jmap and jhat to map heap memory in some moments of the
 indexations.
 Here I show the beginig of the maps (I don't show the lower part of the
 stack because object instance numbers are completely stable in there).
 I have noticed that the number of Term, TermInfo and TermQuery grows
 between
 an indexation and another... is that normal?


 Perhaps you should enable GC logging as well. Also, did you actually run
 out
 of memory or you are interpolating and assuming that it might happen?
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/stress-tests-to-DIH-and-deduplication-patch-tp23295926p23314604.html
Sent from the Solr - User mailing list archive at Nabble.com.

stress tests to DIH and deduplication patch

2009-04-29 Thread Marc Sturlese


Hey there, I am doing some stress tests indexing with DIH.
I am indexing a mysql DB with 140 rows aprox. I am using also the
DeDuplication patch.
I am using tomcat with JVM limit of -Xms2000M -Xmx2000M
I have indexed 3 times using full-import command without restarting tomcat
or reloading the core between the indexations.
I have used jmap and jhat to map heap memory in some moments of the
indexations.
Here I show the beginig of the maps (I don't show the lower part of the
stack because object instance numbers are completely stable in there).
I have noticed that the number of Term, TermInfo and TermQuery grows between
an indexation and another... is that normal?



FIRST TIME I INDEX... WITH A MILION INDEXED DOCS APROX... HERE INDEXING
PROCESS IS STILL RUNNING
268290 instances of class org.apache.lucene.index.Term
215943 instances of class org.apache.lucene.index.TermInfo
129649 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
51537 instances of class org.apache.lucene.search.TermQuery
25457 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
1120 instances of class org.apache.lucene.index.FieldInfo
919 instances of class org.apache.catalina.loader.ResourceEntry 


FIRST TIME I INDEX, COMPLETED (1.4 MILION DOCS INDEXED)
552522 instances of class org.apache.lucene.index.Term
505835 instances of class org.apache.lucene.index.TermInfo
128937 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
48645 instances of class org.apache.lucene.search.TermQuery
24065 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
1470 instances of class org.apache.lucene.index.FieldInfo
923 instances of class org.apache.catalina.loader.ResourceEntry
858 instances of class com.sun.tools.javac.util.List 


SECOND TIME I INDEX WITH 50 INDEXED DOCS... HERE INDEX PROCESS IS STILL
RUNNING 
264617 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
262496 instances of class org.apache.lucene.index.Term
116078 instances of class org.apache.lucene.index.TermInfo
53383 instances of class org.apache.lucene.search.TermQuery
42274 instances of class
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
30230 instances of class org.apache.lucene.search.TermQuery$TermWeight
26044 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
15115 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator
15115 instances of class org.apache.lucene.search.ReqExclScorer
7325 instances of class org.apache.lucene.search.ConjunctionScorer$1
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
1279 instances of class org.apache.lucene.index.FieldInfo
923 instances of class org.apache.catalina.loader.ResourceEntry 


SECOND TIME I INDEX WITH 120 INDEXED DOCS... HERE INDEX PROCESS IS STILL
RUNNING 
574603 instances of class org.apache.lucene.index.Term
423558 instances of class org.apache.lucene.index.TermInfo
141394 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
106729 instances of class org.apache.lucene.search.TermQuery
54858 instances of class org.apache.lucene.index.BufferedDeletes$Num
25347 instances of class
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
11587 instances of class org.apache.lucene.search.TermQuery$TermWeight
5793 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator
5793 instances of class org.apache.lucene.search.ReqExclScorer
2922 instances of class org.apache.lucene.search.ConjunctionScorer$1
2170 instances of class org.apache.lucene.index.FieldInfo
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
923 instances of class org.apache.catalina.loader.ResourceEntry
858 instances of class com.sun.tools.javac.util.List 

SECOND TIME I INDEX, COMPLETED (1.4 MILION DOCS INDEXED)
999753 instances of class org.apache.lucene.index.Term
808190 instances of class org.apache.lucene.index.TermInfo
156511 instances of class org.apache.lucene.search.TermQuery
128975 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
104396 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
15401 instances of class
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
14896 instances of class org.apache.lucene.search.TermQuery$TermWeight
7447 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator
7447 instances of class org.apache.lucene.search.ReqExclScorer
3025 instances of class

Re: stress tests to DIH and deduplication patch

2009-04-29 Thread Shalin Shekhar Mangar

On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 Hey there, I am doing some stress tests indexing with DIH.
 I am indexing a mysql DB with 140 rows aprox. I am using also the
 DeDuplication patch.
 I am using tomcat with JVM limit of -Xms2000M -Xmx2000M
 I have indexed 3 times using full-import command without restarting tomcat
 or reloading the core between the indexations.
 I have used jmap and jhat to map heap memory in some moments of the
 indexations.
 Here I show the beginig of the maps (I don't show the lower part of the
 stack because object instance numbers are completely stable in there).
 I have noticed that the number of Term, TermInfo and TermQuery grows
 between
 an indexation and another... is that normal?


Perhaps you should enable GC logging as well. Also, did you actually run out
of memory or you are interpolating and assuming that it might happen?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Deduplication patch not working in nightly build

2009-01-10 Thread Grant Ingersoll

I've seen similar errors when large background merges happen while
looping in a result set. See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/

On Jan 9, 2009, at 12:50 PM, Mark Miller wrote:

Your basically writing segments more often now, and somehow avoiding
a longer merge I think. Also, likely, deduplication is probably
adding enough extra data to your index to hit a sweet spot where a
merge is too long. Or something to that effect - MySql is especially
sensitive to timeouts when doing a select * on a huge db in my
testing. I didnt understand your answer on the autocommit - I take
it you are using it? Or no?

All a guess, but it def points to a merge taking a bit long and
causing a timeout. I think you can relax the MySql timeout settings
if that is it.

I'd like to get to the bottom of this as well, so any other info you
can provide would be great.

- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had
ramBufferSizeMB32/ramBufferSizeMB

and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk
writting
than I would like... but at least it works fine (but a bit
slower,opiously).

I keep saying that the most weird thing is that I don't have that
problem

using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if
someone can

give me an explanation why this is happening

Shalin Shekhar Mangar wrote:

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:

hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in
solrconfig.xml. I

can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do
the same?

What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got
closed by

the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and
what did

you change it to?

--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Regards,
Shalin Shekhar Mangar.

--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Deduplication patch not working in nightly build


Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/my_db

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller

I can't imagine why dedupe would have anything to do with this, other 
than what was said, it perhaps is taking a bit longer to get a document 
to the db, and it times out (maybe a long signature calculation?). Have 
you tried changing your MySql settings to allow for a longer timeout? 
(sorry, I'm not to up to date on what you have tried).


Also, are you using autocommit during the import? If so, you might try 
turning it off for the full import.


- Mark

Marc Sturlese wrote:

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/my_db

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 


java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException

Re: Deduplication patch not working in nightly build


hey there,
I hadn't autoCommit set to true but I have it sorted! The error stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?

Thanks


Marc Sturlese wrote:
 
 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was working perfectly. Now I upgraded to a nigthly build (20th december)
 to be able to use new facet algorithm and other stuff and DeDuplication is
 not working any more. I have followed exactly the same steps to apply the
 patch to the source code. I am geting this error:
 
 WARNING: Error reading data 
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception: 
 
 ** BEGIN NESTED EXCEPTION ** 
 
 java.io.EOFException
 
 STACKTRACE:
 
 java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception: 
 
 ** BEGIN NESTED EXCEPTION ** 
 
 java.io.EOFException
 
 STACKTRACE:
 
 java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Shalin Shekhar Mangar

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml. I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?



 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.

Re: Deduplication patch not working in nightly build

Hey Shalin,

In the begining (when the error was appearing) i had
ramBufferSizeMB32/ramBufferSizeMB
and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening

Shalin Shekhar Mangar wrote:

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:

hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?

What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?

--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Regards,
Shalin Shekhar Mangar.

--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21376235.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller

Your basically writing segments more often now, and somehow avoiding a
longer merge I think. Also, likely, deduplication is probably adding
enough extra data to your index to hit a sweet spot where a merge is too
long. Or something to that effect - MySql is especially sensitive to
timeouts when doing a select * on a huge db in my testing. I didnt
understand your answer on the autocommit - I take it you are using it?
Or no?

All a guess, but it def points to a merge taking a bit long and causing
a timeout. I think you can relax the MySql timeout settings if that is it.

I'd like to get to the bottom of this as well, so any other info you can
provide would be great.

- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had
ramBufferSizeMB32/ramBufferSizeMB

and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening

Shalin Shekhar Mangar wrote:

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:

What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?

--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Regards,
Shalin Shekhar Mangar.

Re: Deduplication patch not working in nightly build

Hey Mark,
Sorry I was not enough especific, I wanted to mean that I have and I always
had autoCommit=false.
I will do some more traces and test. Will post if I have any new important
thing to mention.

Thanks.

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had
ramBufferSizeMB32/ramBufferSizeMB
and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit
slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening

Shalin Shekhar Mangar wrote:

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:

hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml.
I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the
same?

What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed
by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?

--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
Regards,
Shalin Shekhar Mangar.

--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21378069.html
Sent from the Solr - User mailing list archive at Nabble.com.

Deduplication patch not working in nightly build


Hey there,
I was using the Deduplication patch with Solr 1.3 release and everything was
working perfectly. Now I upgraded to a nigthly build (20th december) to be
able to use new facet algorithm and other stuff and DeDuplication is not
working any more. I have followed exactly the same steps to apply the patch
to the source code. I am geting this error:

WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


** END NESTED EXCEPTION **
Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150)
at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488)
at com.mysql.jdbc.ResultSet.close(ResultSet.java:736)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.close(JdbcDataSource.java:312

Re: Deduplication patch not working in nightly build


Thanks I will have a look to my JdbcDataSource. Anyway it's weird because
using the 1.3 release I don't have that problem...

Shalin Shekhar Mangar wrote:
 
 Yes, initially I figured that we are accidentally re-using a closed data
 source. But Noble has pinned it right. I guess you can try looking into
 your
 JDBC driver's documentation for a setting which increases the connection
 alive-ness.
 
 On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 noble.p...@gmail.com wrote:
 
 I guess the indexing of a doc is taking too long (may be because of
 the de-dup patch) and the resultset gets closed automaticallly (timed
 out)
 --Noble

 On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Donig this fix I get the same error :(
 
  I am going to try to set up the last nigthly build... let's see if I
 have
  better luck.
 
  The thing is it stop indexing at the doc num 150.000 aprox... and give
 me
  that mysql exception error... Without DeDuplication patch I can index 2
  milion docs without problems...
 
  I am pretty lost with this... :(
 
 
  Shalin Shekhar Mangar wrote:
 
  Yes I meant the 05/01/2008 build. The fix is a one line change
 
  Add the following as the last line of DataConfig.Entity.clearCache()
  dataSrc = null;
 
 
 
  On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
  marc.sturl...@gmail.comwrote:
 
 
  Shalin you mean I should test the 05/01/2008 nighlty? maybe with this
 one
  works? If the fix you did is not really big can u tell me where in
 the
  source is and what is it for? (I have been debuging and tracing a lot
 the
  dataimporthandler source and I I would like to know what the
 imporovement
  is
  about if it is not a problem...)
 
  Thanks!
 
 
  Shalin Shekhar Mangar wrote:
  
   Marc, I've just committed a fix which may have caused the bug. Can
 you
  use
   svn trunk (or the next nightly build) and confirm?
  
   On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
   noble.p...@gmail.com wrote:
  
   looks like a bug w/ DIH with the recent fixes.
   --Noble
  
   On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
  marc.sturl...@gmail.com
   wrote:
   
Hey there,
I was using the Deduplication patch with Solr 1.3 release and
   everything
   was
working perfectly. Now I upgraded to a nigthly build (20th
 december)
  to
   be
able to use new facet algorithm and other stuff and
 DeDuplication
 is
   not
working any more. I have followed exactly the same steps to
 apply
  the
   patch
to the source code. I am geting this error:
   
WARNING: Error reading data
com.mysql.jdbc.CommunicationsException: Communications link
 failure
  due
   to
underlying exception:
   
** BEGIN NESTED EXCEPTION **
   
java.io.EOFException
   
STACKTRACE:
   
java.io.EOFException
   at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
   at
   com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
   at
  com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
   at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
   at
   
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
   at
   
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
   
   
** END NESTED EXCEPTION **
Last packet sent to the server was 202481 ms ago.
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्



Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect.  I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandler classes... it's
coz I thought the problem was not there (but can't say it for sure...)

I was thinking that the problem has something to do with the
UpdateRequestProcessorChain but don't know how this part of the source
works...

I am really interested in updating to the nightly build as I think new facet
algorithm and  SolrDeletionPolicy are really great stuff!

Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm? 
You meann the last nightly build?

Thanks


Noble Paul നോബിള്‍ नोब्ळ् wrote:
 
 looks like a bug w/ DIH with the recent fixes.
 --Noble
 
 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was
 working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the
 patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING

Re: Deduplication patch not working in nightly build

looks like a bug w/ DIH with the recent fixes.
--Noble

On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything was
 working perfectly. Now I upgraded to a nigthly build (20th december) to be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150)
at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्

I guess the indexing of a doc is taking too long (may be because of
the de-dup patch) and the resultset gets closed automaticallly (timed
out)
--Noble

On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Donig this fix I get the same error :(

 I am going to try to set up the last nigthly build... let's see if I have
 better luck.

 The thing is it stop indexing at the doc num 150.000 aprox... and give me
 that mysql exception error... Without DeDuplication patch I can index 2
 milion docs without problems...

 I am pretty lost with this... :(


 Shalin Shekhar Mangar wrote:

 Yes I meant the 05/01/2008 build. The fix is a one line change

 Add the following as the last line of DataConfig.Entity.clearCache()
 dataSrc = null;



 On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:


 Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
 works? If the fix you did is not really big can u tell me where in the
 source is and what is it for? (I have been debuging and tracing a lot the
 dataimporthandler source and I I would like to know what the imporovement
 is
 about if it is not a problem...)

 Thanks!


 Shalin Shekhar Mangar wrote:
 
  Marc, I've just committed a fix which may have caused the bug. Can you
 use
  svn trunk (or the next nightly build) and confirm?
 
  On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
  noble.p...@gmail.com wrote:
 
  looks like a bug w/ DIH with the recent fixes.
  --Noble
 
  On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
 marc.sturl...@gmail.com
  wrote:
  
   Hey there,
   I was using the Deduplication patch with Solr 1.3 release and
  everything
  was
   working perfectly. Now I upgraded to a nigthly build (20th december)
 to
  be
   able to use new facet algorithm and other stuff and DeDuplication is
  not
   working any more. I have followed exactly the same steps to apply
 the
  patch
   to the source code. I am geting this error:
  
   WARNING: Error reading data
   com.mysql.jdbc.CommunicationsException: Communications link failure
 due
  to
   underlying exception:
  
   ** BEGIN NESTED EXCEPTION **
  
   java.io.EOFException
  
   STACKTRACE:
  
   java.io.EOFException
  at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  
  
   ** END NESTED EXCEPTION **
   Last packet sent to the server was 202481 ms ago.
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225

Re: Deduplication patch not working in nightly build