Re: deduplication of suggester results are not enough
Hi Roland, I wrote AnalyzingInfixSuggester that deduplicates data on several levels at index time. I will publish it in few days on github. I'll wrote to this thread when done. m. On štvrtok 26. marca 2020 16:01:57 CET Szűcs Roland wrote: > Hi All, > > I follow the discussion of the suggester related discussions quite a while > ago. Everybody agrees that it is not the expected behaviour from a > Suggester where the terms are the entities and not the documents to return > the same string representation several times. > > One suggestion was to make deduplication on client side of Solr. It is very > easy in most of the client solution as any set based data structure solve > this. > > *But one important problem is not solved the deduplication: suggest.count*. > > If I have15 matches by the suggester and the suggest.count=10 and the first > 9 matches are the same, I will get back only 2 after the deduplication and > the remaining 5 unique terms will be never shown. > > What is the solution for this? > > Cheers, > Roland >
deduplication of suggester results are not enough
Hi All, I follow the discussion of the suggester related discussions quite a while ago. Everybody agrees that it is not the expected behaviour from a Suggester where the terms are the entities and not the documents to return the same string representation several times. One suggestion was to make deduplication on client side of Solr. It is very easy in most of the client solution as any set based data structure solve this. *But one important problem is not solved the deduplication: suggest.count*. If I have15 matches by the suggester and the suggest.count=10 and the first 9 matches are the same, I will get back only 2 after the deduplication and the remaining 5 unique terms will be never shown. What is the solution for this? Cheers, Roland
Atomic update deletes deduplication signature
Hello, I am having trouble when doing atomic updates in combination with SignatureUpdateProcessorFactory (on Solr 7.2). Normal commits of new documents work as expected and generate a valid signature: curl "$URL/update?commit=true" -H 'Content-type:application/json' -d '{"add":{"doc":{"id": "TEST_ID1", "description": "description", "country": "country"}}}' && curl "$URL/select?q=id:TEST_ID1" "response":{"numFound":1,"start":0,"docs":[ { "id":"TEST_ID1", "description":["description"], "country":["country"], "_signature":"e577e465b9099ba8", <-- valid signature "_version_":1608322850016460800}] }} However, when updating a field (that is not used for generating the signature) the signature is replaced by "": curl "$URL/update?commit=true" -H 'Content-type:application/json' -d '{"add":{"doc":{"id": "TEST_ID1", "country": {"set": "country2"' && curl "$URL/select?q=id:TEST_ID1" "response":{"numFound":1,"start":0,"docs":[ { "id":"TEST_ID1", "description":["description"], "country":["country2"], "_signature":"", <-- broken signature "_version_":1608322857485467648}] }} This looks a lot like the second problem mentioned in an old Solr JIRA issue ([1]). Unfortunately, there is no relevant response in the discussion there. Any ideas how to fix this? Thank you, Thomas solrconfig.xml: [...] true _signature false description solr.processor.Lookup3Signature [1] https://issues.apache.org/jira/browse/SOLR-4016
RE: Solr Cloud: query elevation + deduplication?
Hi, I would not use ID (uniqueKey) as signature field, query elevation would never work properly with such a set up, change a document's content, and it 'll get a new ID. If i remember correctly this factory still deletes duplicates if signatureField is not uniqueKey. Regarding SOLR-3473, nobody seems to be working on that. Regards, Markus -Original message- > From:Ronja Koistinen <ronja.koisti...@helsinki.fi> > Sent: Monday 5th March 2018 15:32 > To: solr-user@lucene.apache.org > Subject: Solr Cloud: query elevation + deduplication? > > Hello, > > I am running Solr Cloud 6.6.2 and trying to get query elevation and > deduplication (with SignatureUpdateProcessor) working at the same time. > > The documentation for deduplication > (https://lucene.apache.org/solr/guide/6_6/de-duplication.html) does not > specify if the signatureField needs to be the uniqueKey field configured > in my schema.xml. Currently I have my uniqueKey set to the field > containing the url of my documents. > > The query elevation seems to reference documents by the uniqueKey in the > "id" attributes listed in elevate.xml, so having the uniqueKey be the > url would be beneficial to my process of maintaining the query elevation > list. > > Also, what is the status of this issue I found? > https://issues.apache.org/jira/browse/SOLR-3473 > > -- > Ronja Koistinen > University of Helsinki > >
Solr Cloud: query elevation + deduplication?
Hello, I am running Solr Cloud 6.6.2 and trying to get query elevation and deduplication (with SignatureUpdateProcessor) working at the same time. The documentation for deduplication (https://lucene.apache.org/solr/guide/6_6/de-duplication.html) does not specify if the signatureField needs to be the uniqueKey field configured in my schema.xml. Currently I have my uniqueKey set to the field containing the url of my documents. The query elevation seems to reference documents by the uniqueKey in the "id" attributes listed in elevate.xml, so having the uniqueKey be the url would be beneficial to my process of maintaining the query elevation list. Also, what is the status of this issue I found? https://issues.apache.org/jira/browse/SOLR-3473 -- Ronja Koistinen University of Helsinki signature.asc Description: OpenPGP digital signature
Re: Deduplication
Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: My understanding was that the distributed update processor is near the end of the chain, so that running of user update processors occurs before the distribution step, but is that distribution to the leader, or distribution from leader to replicas for a shard? That would pose some potential problems. Would a custom update processor make the solution cloud-safe? Thx, - Bram
Re: Deduplication
On 19/05/15 14:47, Alessandro Benedetti wrote: Hi Bram, what do you mean with : I would like it to provide the unique value myself, without having the deduplicator create a hash of field values . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your very first part of the mail) here you can find a lot of info : Not sure whether de-duplication is the right word for what I'm after, I essentially want a unique constraint on an arbitrary field. Without overwrite semantics, because I want Solr to tell me if a duplicate is sent to Solr. I was thinking that the de-duplication feature could accomplish this somehow. - Bram
Re: Deduplication
What the Solr de-duplciation offers you is to calculate for each document in input an Hash ( based on a set of fields). You can then select two options : - Index everything, documents with same signature will be equals - avoid the overwriting of duplicates. How the similarity has is calculated is something you can play with and customise if needed. Clarified that, do you think can fit in some way, or definitely you are not talking about deduce ? 2015-05-20 8:37 GMT+01:00 Bram Van Dam bram.van...@intix.eu: On 19/05/15 14:47, Alessandro Benedetti wrote: Hi Bram, what do you mean with : I would like it to provide the unique value myself, without having the deduplicator create a hash of field values . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your very first part of the mail) here you can find a lot of info : Not sure whether de-duplication is the right word for what I'm after, I essentially want a unique constraint on an arbitrary field. Without overwrite semantics, because I want Solr to tell me if a duplicate is sent to Solr. I was thinking that the de-duplication feature could accomplish this somehow. - Bram -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Deduplication
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam bram.van...@intix.eu wrote: Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. This sounded like the perfect option ... until I read Jack's comment: My understanding was that the distributed update processor is near the end of the chain, so that running of user update processors occurs before the distribution step, but is that distribution to the leader, or distribution from leader to replicas for a shard? That would pose some potential problems. Would a custom update processor make the solution cloud-safe? Starting with Solr 5.1, you have the ability to specify an update processor on the fly to requests and you can even control whether it is to be executed before any distribution happens or before it is actually indexed on the replica. e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to have processor xyz run first and then MyCustomUpdateProc and then the default update processor chain (which will also distribute the doc to the leader or from the leader to a replica). This also means that such processors will not be executed on the replicas at all. You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and MyCustomUpdateProc to run on each replica (including the leader) right before the doc is indexed (i.e. just before RunUpdateProcessor) Unfortunately, due to an oversight, this feature hasn't been documented well which is something I'll fix. See https://issues.apache.org/jira/browse/SOLR-6892 for more details. Thx, - Bram -- Regards, Shalin Shekhar Mangar.
Deduplication
Hi folks, I'm looking for a way to have Solr reject documents if a certain field value is duplicated (reject, not overwrite). There doesn't seem to be any kind of unique option in schema fields. The de-duplication feature seems to make this (somewhat) possible, but I would like it to provide the unique value myself, without having the deduplicator create a hash of field values. Am I missing an obvious (or less obvious) way of accomplishing this? Thanks, - Bram
Re: Deduplication
Hi Bram, what do you mean with : I would like it to provide the unique value myself, without having the deduplicator create a hash of field values . This is not reduplication, but simple document filtering based on a constraint. In the case you want de-duplication ( which seemed from your very first part of the mail) here you can find a lot of info : https://cwiki.apache.org/confluence/display/solr/De-Duplication Let me know for more detailed requirements! 2015-05-19 10:02 GMT+01:00 Bram Van Dam bram.van...@intix.eu: Hi folks, I'm looking for a way to have Solr reject documents if a certain field value is duplicated (reject, not overwrite). There doesn't seem to be any kind of unique option in schema fields. The de-duplication feature seems to make this (somewhat) possible, but I would like it to provide the unique value myself, without having the deduplicator create a hash of field values. Am I missing an obvious (or less obvious) way of accomplishing this? Thanks, - Bram -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Deduplication
Shawn, I was going to say the same thing, but... then I was thinking about SolrCloud and the fact that update processors are invoked before the document is set to its target node, so there wouldn't be a reliable way to tell if the input document field value exists on the target rather than current node. Or does the update processing only occur on the leader node after being forwarded from the originating node? Is the doc clear on this detail? My understanding was that the distributed update processor is near the end of the chain, so that running of user update processors occurs before the distribution step, but is that distribution to the leader, or distribution from leader to replicas for a shard? -- Jack Krupansky On Tue, May 19, 2015 at 9:01 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/19/2015 3:02 AM, Bram Van Dam wrote: I'm looking for a way to have Solr reject documents if a certain field value is duplicated (reject, not overwrite). There doesn't seem to be any kind of unique option in schema fields. The de-duplication feature seems to make this (somewhat) possible, but I would like it to provide the unique value myself, without having the deduplicator create a hash of field values. Am I missing an obvious (or less obvious) way of accomplishing this? Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. A script update processor is included with Solr allows you to write your processor in a language other than Java, such as javascript. https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html Here's how to discard a document in an update processor written in Java: http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor The javadoc that I linked above describes the ability to return false in other languages to discard the document. Thanks, Shawn
Re: Deduplication
On 5/19/2015 3:02 AM, Bram Van Dam wrote: I'm looking for a way to have Solr reject documents if a certain field value is duplicated (reject, not overwrite). There doesn't seem to be any kind of unique option in schema fields. The de-duplication feature seems to make this (somewhat) possible, but I would like it to provide the unique value myself, without having the deduplicator create a hash of field values. Am I missing an obvious (or less obvious) way of accomplishing this? Write a custom update processor and include it in your update chain. You will then have the ability to do anything you want with the entire input document before it hits the code to actually do the indexing. A script update processor is included with Solr allows you to write your processor in a language other than Java, such as javascript. https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html Here's how to discard a document in an update processor written in Java: http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor The javadoc that I linked above describes the ability to return false in other languages to discard the document. Thanks, Shawn
any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?
For example, given a new big department merged from three departments. A few employees worked for two or three departments before merging. That means, the attributes of one person might be listed under different departments' databases. One additional problem is that one person can have different first names or nick names. These attributes of a person include first name, last name, email, home phone, cell phone, ssn, address, etc ... Because some values of the above could be empty, there is no unique primary key. Hence, we need an intelligent solution for the classification, and to put weights for different matching rules. Any tips to handle such runtime fast deduplication tasks for big data (about 100 million records)? Any open-source project working on this?
Re: any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?
See: https://cwiki.apache.org/confluence/display/solr/De-Duplication -- Jack Krupansky -Original Message- From: Mobius ReX Sent: Monday, March 17, 2014 1:59 PM To: solr-user@lucene.apache.org Subject: any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene? For example, given a new big department merged from three departments. A few employees worked for two or three departments before merging. That means, the attributes of one person might be listed under different departments' databases. One additional problem is that one person can have different first names or nick names. These attributes of a person include first name, last name, email, home phone, cell phone, ssn, address, etc ... Because some values of the above could be empty, there is no unique primary key. Hence, we need an intelligent solution for the classification, and to put weights for different matching rules. Any tips to handle such runtime fast deduplication tasks for big data (about 100 million records)? Any open-source project working on this?
Re: Newbie question on Deduplication overWriteDupes flag
: How do I achieve, add if not there, fail if duplicate is found. I though You can use the optimistic concurrency features to do this, by including a _version_=-1 field value in the document. this will instruct solr that the update should only be processed if the document does not already exist... https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents -Hoss http://www.lucidworks.com/
Re: Newbie question on Deduplication overWriteDupes flag
A follow up question on this (as it is kind of new functionality). What happens if several documents are submitted and one of them fails due to that? Do they get rolled back or only one? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Feb 6, 2014 at 11:17 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : How do I achieve, add if not there, fail if duplicate is found. I though You can use the optimistic concurrency features to do this, by including a _version_=-1 field value in the document. this will instruct solr that the update should only be processed if the document does not already exist... https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents -Hoss http://www.lucidworks.com/
Newbie question on Deduplication overWriteDupes flag
I had a configuration where I had overwriteDupes=false. Result: I got duplicate documents in the index. When I changed to overwriteDupes=false, the duplicate documents started overwriting the older documents. How do I achieve, add if not there, fail if duplicate is found. I though that overwriteDupes=false would do that. -- View this message in context: http://lucene.472066.n3.nabble.com/Newbie-question-on-Deduplication-overWriteDupes-flag-tp4115212.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Deduplication use of overWriteDupes flag
Hello, I had a configuration where I had overwriteDupes=false. I added few duplicate documents. Result: I got duplicate documents in the index. When I changed to overwriteDupes=true, the duplicate documents started overwriting the older documents. Question 1: How do I achieve, [add if not there, fail if duplicate is found] i.e. mimic the behaviour of a DB which fails when trying to insert a record which violates some unique constraint. I thought that overwriteDupes=false would do that, but apparently not. Question2: Is there some documentation around overwriteDupes? I have checked the existing Wiki; there is very little explanation of the flag there. Thanks, -Amit
Custom update handler with deduplication
Currently I've the following Update Request Processor chain to prevent indexing very similar text items into a core dedicated to store queries that our users put into the web interface of our system. !-- Delete similar duplicated documents on index time, using some fuzzy text similary techniques -- updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignature/str str name=fieldstextsuggest,textng/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Right now we are trying to implement a custom update request handler to keep track of how many any given query hits our solr server, in plain simple we want to keep a field that counts how many we have tried to insert the same query. We are using Solr 3.6, so how can we use (from the code of our custom update handler) the deduplicatin request processor to check if the query we are trying to insert/update already exists? Greetings! III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu
Re: Custom update handler with deduplication
Firstly, I see that you have overwriteDupes=false in your configuration. This means that a signature will be generated but the similar documents will still be added to the index. Now to your main question about counting duplicate attempts, one simple way is to have another UpdateRequestProcessor after the SignatureUpdateProcessor which keeps a map of Signature to Count. You can even keep this counter inside the Solr document as well and first read the old counter value by querying the signatureField and then writing the new value in the new document. Be careful about race conditions if you're reading from the index because indexing can happen in multiple threads. On Mon, Dec 16, 2013 at 9:01 AM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: Currently I've the following Update Request Processor chain to prevent indexing very similar text items into a core dedicated to store queries that our users put into the web interface of our system. !-- Delete similar duplicated documents on index time, using some fuzzy text similary techniques -- updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignature/str str name=fieldstextsuggest,textng/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain Right now we are trying to implement a custom update request handler to keep track of how many any given query hits our solr server, in plain simple we want to keep a field that counts how many we have tried to insert the same query. We are using Solr 3.6, so how can we use (from the code of our custom update handler) the deduplicatin request processor to check if the query we are trying to insert/update already exists? Greetings! III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu -- Regards, Shalin Shekhar Mangar.
Pros and Cons of Using Deduplication of Solr at Huge Data Indexing
I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly it does and does it results with a slow indexing or is it beneficial for my situation?
RE: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing
Distributed deduplication does not work right now: https://issues.apache.org/jira/browse/SOLR-3473 We've chosen not do use update processors for deduplication anymore and rely on several custom mapreduce jobs in Nutch and some custom collectors in Solr to do some on-demand online deduplication. If SOLR-3473 is fixed you can get very decent deduplication. -Original message- From:Furkan KAMACI furkankam...@gmail.com Sent: Thu 02-May-2013 22:30 To: solr-user@lucene.apache.org Subject: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly it does and does it results with a slow indexing or is it beneficial for my situation?
Deduplication in SolrCloud
Hi, in my old Solr Setup I have used the deduplication feature in the update chain with couple of fields. updateRequestProcessorChain name=dedupe processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldsuuid,type,url,content_hash/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain This worked fine. When I now use this in my 2 shards SolrCloud setup when inserting 150.000 documents, I am always getting an error: *INFO: end_commit_flush* *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log* *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create new native thread* * at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456) * * at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284) * I am inserting the documents via CSV import and curl command and split them also into 50k chunks. Without the dedupe chain, the import finishes after 40secs. The curl command writes to one of my shards. Do you have an idea why this happens? Should I reduce the fields to one? I have read that not using the id as dedupe fields could be an issue? I have searched for deduplication with SolrCloud and I am wondering if it is already working correctly? see e.g. http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html Thanks regards Daniel
RE: Deduplication in SolrCloud
This issue doesn't really describe your problem but a more general problem of distributed deduplication: https://issues.apache.org/jira/browse/SOLR-3473 -Original message- From:Daniel Brügge daniel.brue...@googlemail.com Sent: Fri 27-Jul-2012 17:38 To: solr-user@lucene.apache.org Subject: Deduplication in SolrCloud Hi, in my old Solr Setup I have used the deduplication feature in the update chain with couple of fields. updateRequestProcessorChain name=dedupe processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldsuuid,type,url,content_hash/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain This worked fine. When I now use this in my 2 shards SolrCloud setup when inserting 150.000 documents, I am always getting an error: *INFO: end_commit_flush* *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log* *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create new native thread* * at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456) * * at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284) * I am inserting the documents via CSV import and curl command and split them also into 50k chunks. Without the dedupe chain, the import finishes after 40secs. The curl command writes to one of my shards. Do you have an idea why this happens? Should I reduce the fields to one? I have read that not using the id as dedupe fields could be an issue? I have searched for deduplication with SolrCloud and I am wondering if it is already working correctly? see e.g. http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html Thanks regards Daniel
Re: Deduplication in SolrCloud
Should the old Signature code be removed? Given that the goal is to have everyone use SolrCloud, maybe this kind of landmine should be removed? On Fri, Jul 27, 2012 at 8:43 AM, Markus Jelsma markus.jel...@openindex.io wrote: This issue doesn't really describe your problem but a more general problem of distributed deduplication: https://issues.apache.org/jira/browse/SOLR-3473 -Original message- From:Daniel Brügge daniel.brue...@googlemail.com Sent: Fri 27-Jul-2012 17:38 To: solr-user@lucene.apache.org Subject: Deduplication in SolrCloud Hi, in my old Solr Setup I have used the deduplication feature in the update chain with couple of fields. updateRequestProcessorChain name=dedupe processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldsuuid,type,url,content_hash/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain This worked fine. When I now use this in my 2 shards SolrCloud setup when inserting 150.000 documents, I am always getting an error: *INFO: end_commit_flush* *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log* *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create new native thread* * at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456) * * at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284) * I am inserting the documents via CSV import and curl command and split them also into 50k chunks. Without the dedupe chain, the import finishes after 40secs. The curl command writes to one of my shards. Do you have an idea why this happens? Should I reduce the fields to one? I have read that not using the id as dedupe fields could be an issue? I have searched for deduplication with SolrCloud and I am wondering if it is already working correctly? see e.g. http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html Thanks regards Daniel -- Lance Norskog goks...@gmail.com
Deduplication in MLT
I have an implementation of Deduplication as mentioned at http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search results. I would like to achieve the same functionality in my MLT queries, where the result set should include grouped documents. What is a good way to do the same? *Pranav Prakash* temet nosce
RE: SolrCloud deduplication
Hi, SOLR-2822 seems to work just fine as long as the SignatureProcessor precedes the DistributedProcessor in the update chain. Thanks, Markus -Original message- From:Mark Miller markrmil...@gmail.com Sent: Fri 18-May-2012 16:05 To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io Subject: Re: SolrCloud deduplication Hey Markus - When I ran into a similar issue with another update proc, I created https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things to avoid this. I have not committed this yet though, in favor of waiting for https://issues.apache.org/jira/browse/SOLR-2822 Go vote? :) On May 18, 2012, at 7:49 AM, Markus Jelsma wrote: Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get this typical error. Changing the order or URP's in the chain does not solve the problem. Any hints on how to resolve the issue? Is this a problem in the SignatureUpdateRequestProcessor and does it need to be updated to work with SolrCloud? Thanks, Markus - Mark Miller lucidimagination.com
RE: SolrCloud deduplication
Hi again, It seemed to work fine but in the end duplicates are not overwritten. We first run the SignatureProcessor and then the DistributedProcessor. If we do it the other way around the digest field receives multiple values and throws errors. Is there anything else we can do or another patch to try? Thanks Markus -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Mon 21-May-2012 15:58 To: solr-user@lucene.apache.org; Mark Miller markrmil...@gmail.com Subject: RE: SolrCloud deduplication Hi, SOLR-2822 seems to work just fine as long as the SignatureProcessor precedes the DistributedProcessor in the update chain. Thanks, Markus -Original message- From:Mark Miller markrmil...@gmail.com Sent: Fri 18-May-2012 16:05 To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io Subject: Re: SolrCloud deduplication Hey Markus - When I ran into a similar issue with another update proc, I created https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things to avoid this. I have not committed this yet though, in favor of waiting for https://issues.apache.org/jira/browse/SOLR-2822 Go vote? :) On May 18, 2012, at 7:49 AM, Markus Jelsma wrote: Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get this typical error. Changing the order or URP's in the chain does not solve the problem. Any hints on how to resolve the issue? Is this a problem in the SignatureUpdateRequestProcessor and does it need to be updated to work with SolrCloud? Thanks, Markus - Mark Miller lucidimagination.com
RE: SolrCloud deduplication
https://issues.apache.org/jira/browse/SOLR-3473 -Original message- From:Mark Miller markrmil...@gmail.com Sent: Mon 21-May-2012 18:11 To: solr-user@lucene.apache.org Subject: Re: SolrCloud deduplication Looking again at the SignatureUpdateProcessor code, I think that indeed this won't currently work with distrib updates. Could you file a JIRA issue for that? The problem is that we convert update commands into solr documents - and that can cause a loss of info if an update proc modifies the update command. I think the reason that you see a multiple values error when you try the other order is because of the lack of a document clone (the other issue I mentioned a few emails back). Addressing that won't solve your issue though - we have to come up with a way to propagate the currently lost info on the update command. - Mark On May 21, 2012, at 10:39 AM, Markus Jelsma wrote: Hi again, It seemed to work fine but in the end duplicates are not overwritten. We first run the SignatureProcessor and then the DistributedProcessor. If we do it the other way around the digest field receives multiple values and throws errors. Is there anything else we can do or another patch to try? Thanks Markus -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Mon 21-May-2012 15:58 To: solr-user@lucene.apache.org; Mark Miller markrmil...@gmail.com Subject: RE: SolrCloud deduplication Hi, SOLR-2822 seems to work just fine as long as the SignatureProcessor precedes the DistributedProcessor in the update chain. Thanks, Markus -Original message- From:Mark Miller markrmil...@gmail.com Sent: Fri 18-May-2012 16:05 To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io Subject: Re: SolrCloud deduplication Hey Markus - When I ran into a similar issue with another update proc, I created https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things to avoid this. I have not committed this yet though, in favor of waiting for https://issues.apache.org/jira/browse/SOLR-2822 Go vote? :) On May 18, 2012, at 7:49 AM, Markus Jelsma wrote: Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get this typical error. Changing the order or URP's in the chain does not solve the problem. Any hints on how to resolve the issue? Is this a problem in the SignatureUpdateRequestProcessor and does it need to be updated to work with SolrCloud? Thanks, Markus - Mark Miller lucidimagination.com - Mark Miller lucidimagination.com
SolrCloud deduplication
Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get this typical error. Changing the order or URP's in the chain does not solve the problem. Any hints on how to resolve the issue? Is this a problem in the SignatureUpdateRequestProcessor and does it need to be updated to work with SolrCloud? Thanks, Markus
Re: SolrCloud deduplication
Hey Markus - When I ran into a similar issue with another update proc, I created https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things to avoid this. I have not committed this yet though, in favor of waiting for https://issues.apache.org/jira/browse/SOLR-2822 Go vote? :) On May 18, 2012, at 7:49 AM, Markus Jelsma wrote: Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get this typical error. Changing the order or URP's in the chain does not solve the problem. Any hints on how to resolve the issue? Is this a problem in the SignatureUpdateRequestProcessor and does it need to be updated to work with SolrCloud? Thanks, Markus - Mark Miller lucidimagination.com
RE: SolrCloud deduplication
Hi, Interesting! I'm watching the issues and will test as soon as they are committed. Thanks! -Original message- From:Mark Miller markrmil...@gmail.com Sent: Fri 18-May-2012 16:05 To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io Subject: Re: SolrCloud deduplication Hey Markus - When I ran into a similar issue with another update proc, I created https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things to avoid this. I have not committed this yet though, in favor of waiting for https://issues.apache.org/jira/browse/SOLR-2822 Go vote? :) On May 18, 2012, at 7:49 AM, Markus Jelsma wrote: Hi, Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not functional anymore. The problem is that documents are passed multiple times through the URP and the digest field is added as if it is an multi valued field. If the field is not multi valued you'll get this typical error. Changing the order or URP's in the chain does not solve the problem. Any hints on how to resolve the issue? Is this a problem in the SignatureUpdateRequestProcessor and does it need to be updated to work with SolrCloud? Thanks, Markus - Mark Miller lucidimagination.com
RE: SolrCloud deduplication
: Interesting! I'm watching the issues and will test as soon as they are committed. FWIW: it's a chicken and egg problem -- if you could test out the patch in SOLR-2822 with your real world use case / configs, and comment on it's effectiveness, that would go a long way towards my confidence in it. -Hoss
RE: SolrCloud deduplication
you're right. I'll test the patch as soon as possible. Thanks! -Original message- From:Chris Hostetter hossman_luc...@fucit.org Sent: Fri 18-May-2012 18:20 To: solr-user@lucene.apache.org Subject: RE: SolrCloud deduplication : Interesting! I'm watching the issues and will test as soon as they are committed. FWIW: it's a chicken and egg problem -- if you could test out the patch in SOLR-2822 with your real world use case / configs, and comment on it's effectiveness, that would go a long way towards my confidence in it. -Hoss
Re: null pointer error with solr deduplication
A better error would be nicer. In the past, when I have had docs with the same id on multiple shards, I never saw an NPE problem. A lot has changed since then though. I guess, to me, checking if the id is stored sticks out a bit more. Roughly based on the stacktrace, it looks to me like it's not finding an id value and that is causing the NPE. If it's a legit problem we should probably make a JIRA issue about improving the error message you end up getting. -- - Mark http://www.lucidimagination.com On Sat, Apr 21, 2012 at 5:21 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi I might be wrong but it's your responsibility to put unique doc IDs across shards. read this page http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations particualry - Documents must have a unique key and the unique key must be stored (stored=true in schema.xml) - *The unique key field must be unique across all shards.* If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. So solr bahaves as it should :) _unexpectidly_ But I agree in that sence that there must be no error especially such as NPE. Best Regards Alexander Aristov On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote: Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But, when I have multiple cores and try to do a distributed search ( Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id ) I get the error pasted below. While normal search (with just q) works fine, the facet/stats queries seem to be the culprit. The doc_id contains duplicate ids since I'm testing the same set of documents indexed in both the cores(dedupe, dedupe2). Any insights would be highly appreciated. Thanks 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: null pointer error with solr deduplication
Thanks for the response. Yes, I agree with you that I have to check for the uniqueness of doc ids but our requirement is such that we need to send it to solr and I know that solr discards duplicate documents and it does not work fine when we manually create the unique id. But I just wanted to report the error since in this scenario (i guess the components for deduplication are pretty new), it would probably help the devs to make the behavior more deterministic towards duplicate documents. On Sat, Apr 21, 2012 at 2:21 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi I might be wrong but it's your responsibility to put unique doc IDs across shards. read this page http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations particualry - Documents must have a unique key and the unique key must be stored (stored=true in schema.xml) - *The unique key field must be unique across all shards.* If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. So solr bahaves as it should :) _unexpectidly_ But I agree in that sence that there must be no error especially such as NPE. Best Regards Alexander Aristov On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote: Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But, when I have multiple cores and try to do a distributed search ( Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id ) I get the error pasted below. While normal search (with just q) works fine, the facet/stats queries seem to be the culprit. The doc_id contains duplicate ids since I'm testing the same set of documents indexed in both the cores(dedupe, dedupe2). Any insights would be highly appreciated. Thanks 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: null pointer error with solr deduplication
Hi I might be wrong but it's your responsibility to put unique doc IDs across shards. read this page http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations particualry - Documents must have a unique key and the unique key must be stored (stored=true in schema.xml) - *The unique key field must be unique across all shards.* If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. So solr bahaves as it should :) _unexpectidly_ But I agree in that sence that there must be no error especially such as NPE. Best Regards Alexander Aristov On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote: Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But, when I have multiple cores and try to do a distributed search ( Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id ) I get the error pasted below. While normal search (with just q) works fine, the facet/stats queries seem to be the culprit. The doc_id contains duplicate ids since I'm testing the same set of documents indexed in both the cores(dedupe, dedupe2). Any insights would be highly appreciated. Thanks 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
null pointer error with solr deduplication
Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But, when I have multiple cores and try to do a distributed search ( Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id) I get the error pasted below. While normal search (with just q) works fine, the facet/stats queries seem to be the culprit. The doc_id contains duplicate ids since I'm testing the same set of documents indexed in both the cores(dedupe, dedupe2). Any insights would be highly appreciated. Thanks 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Similar documents and advantages / disadvantages of MLT / Deduplication
: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted : blog articles from different sources, with slight changes (author name, : etc..)). : But they have differences. : *Now i like to see 1 doc in my result set and the other 4 should be marked : as similar.* Do you actaully want al 1000 docs in your index, or do you want to prevent 4 of the 5 copies of hte doc from being indexed? Either way, if the the TextProfileSignature is doing a good job of identifying the 5 similar docs, then use that at index time. If you want to keep 4/5 out of the index, then use the Deduplcation features to prefent the duplicates from being indexed and your done. If you wnat all docs in the index, then you have to decide how you want to mark docs as similar ... do you want to only have one of those docs appear in all of your results, or do you want all of them in the results but with an indication that there are other similar docs? If the former: then take a look at Grouping and group on your signature field. If the latter, use the MLT component, to find similar docs based on the signature field (ie: mlt.fl=signature_t) https://wiki.apache.org/solr/FieldCollapsing -Hoss
Similar documents and advantages / disadvantages of MLT / Deduplication
Hello folks, i have questions about MLT and Deduplication and what would be the best choice in my case. Case: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted blog articles from different sources, with slight changes (author name, etc..)). But they have differences. *Now i like to see 1 doc in my result set and the other 4 should be marked as similar.* With *MLT*: str name=mlt.fltext/str int name=mlt.minwl5/int int name=mlt.maxwl50/int int name=mlt.maxqt3/int int name=mlt.maxntp5000/int bool name=mlt.boosttrue/bool str name=mlt.qftext/str /lst With this config i get about 500 similar docs for this 1 doc, unfortunately too much. *Deduplication*: I index this docs now with an signature and i'm using TextProfileSignature. updateRequestProcessorChain name=dedupe processor class=solr.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature_t/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClasssolr.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How can i compare the created signatures? I want only see the 5 similar docs, nothing else. Which of this two cases is relevant to me? Can i tune the MLT for my requirement? Or should i use Dedupe? Thanks and Regards Vadim
Re: A good signature class for deduplication
: I want to deduplicate documents from search results. What should be the : parameters on which I should decide an efficient SignatureClass? Also, what : are the SignaureClasses available? the signature classes available are the ones mentioned on the wiki... https://wiki.apache.org/solr/Deduplication ...which one you should choose, and which fields you feed it depend entirely on your goal -- if you want to deduplicate anytime both the user_fname and user_lname fields are exactly the same, then use those fields with either the MD5Signature or the Lookup3Signature -- (lookup3 is faster, but some people want MD5 because they want to use the computed MD5 for other things) if you want to detext when some much longer body field containing a lot of full test is *nearly* identical, then you should consider the TextProfileSignature -- how exactly it works and how you tune it i don't know off the top of my head. -Hoss
Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case
Solr 3.3. has a feature Grouping. Is it practically same as deduplication? Here is my use case for duplicates removal - We have many documents with similar (upto 99%) content. Upon some search queries, almost all of them come up on first page results. Of all these documents, essentially one is original and the other are duplicates. We are able to find the original content on a basis of number of factors - who uploaded it, when, how many viral shares.It is also possible that the duplicates are uploaded earlier (and hence exist in search index) while the original is uploaded later (and gets added later to index). AFAIK, Deduplication targets index time. Is there a means I can specify the original which should be returned and the duplicates which could be removed from coming up.? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case
Deduplication uses lucene indexWriter.updateDocument using the signature term. I don't think it's possible as a default feature to choose wich document to index, the original should be always the last to be indexed. /IndexWriter.updateDocument Updates a document by first deleting the document(s) containing term and then adding the new document. The delete and then add are atomic as seen by a reader on the same index (flush may happen only after the add)./ With grouping you have all your documents indexed so it gives you more flexibility -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-3-3-Grouping-vs-DeDuplication-and-Deduplication-Use-Case-tp3294711p3295023.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to combine Deduplication and Elevation
: Hi I have a question. How to combine the Deduplication and Elevation : implementations in Solr. Currently , I managed to implement either one only. can you elaborate a bit more on what exactly you've tried and what problem you are facing? the SignatureUpdateProcessorFactory (which is used for Deduplication) and the QueryElevation component should work just fine together -- in fact: one is used at index time and hte ohter at query time, so where shouldn't be any interaction at all... http://wiki.apache.org/solr/Deduplication http://wiki.apache.org/solr/QueryElevationComponent -Hoss
How to combine Deduplication and Elevation
Hi I have a question. How to combine the Deduplication and Elevation implementations in Solr. Currently , I managed to implement either one only. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-combine-Deduplication-and-Elevation-tp2819621p2819621.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Deduplication questions
: Q1. Is is possible to pass *analyzed* content to the : : public abstract class Signature { No, analysis happens as the documents are being written to the lucene index, well after the UpdateProcessors have had a chance to interact with the values. : Q2. Method calculate() is using concatenated fields from str : name=fieldsname,features,cat/str : Is there any mechanism I could build field dependant signatures? At the moment the Signature API is fairly minimal, but it could definitley be improved by adding more methods (that have sensible defaults in the base class) that would give the impl more control over teh resulting signature ... we just beed people to propose good suggestions with example use cases. : Is idea to make two UpdadeProcessors and chain them OK? (Is ugly, but : would work) I don't know that what you describe is really intentional or not, but it should work -Hoss
Re: Question about http://wiki.apache.org/solr/Deduplication
Thanks Hoss, Externanlizing this part is exactly the path we are exploring now, not only for this reason. We already started testing Hadoop SequenceFile for write ahead log for updates/deletes. SequenceFile supports append now (simply great!). It was a a pain to have to add hadoop into mix for mortal collection sizes 200 Mio, but on the other side, having hadoop around offers huge flexibility. Write ahead log catches update commands (all solr slaves, fronting clients accept updates but only to forward them to WAL). Solr master is trying to catch up with update stream indexing in async fashion, and finally solr slaves are chasing master index with standard solr replication. Overnight we run simple map reduce jobs to consolidate, normalize and sort update stream and reindex at the end. Deduplication and collection sorting is for us only an optimization, if done reasonably offten, like once per day/week, but if we do not do it, it doubles HW resorces. Imo, native WAL support on solr would be definitly one nice nice to have (for HA, update scalability...). Charming with WAL is that updates never wait/disapear, if too much traffic, we only have slightly higher update latency, but updates get definitley processed. Some basic primitives on WAL (consolidation, replaying update stream on solr etc...) should be supported in this case, sort of smallish hadoop features subset for solr clusters, but nothing oversized. Cheers, eks On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : Is it possible in solr to have multivalued id? Or I need to make my : own mv_ID for this? Any ideas how to achieve this efficiently? This isn't something the SignatureUpdateProcessor is going to be able to hel pyou with -- it does the deduplication be changing hte low level update (implemented as a delete then add) so that the key used to delete the older documents is based on the signature field instead of the id field. in order to do what you are describing, you would need to query the index for matching signatures, then add the resulting ids to your document before doing that update You could posibly do this in a custom UpdateProcessor, but you'd have to do something tricky to ensure you didn't overlook docs that had been addd but not yet committed when checking for dups. I don't have a good suggestion for how to do this internally in Slr -- it seems like the type of bulk processing logic that would be better suited for an external process before you ever start indexing (much like link analysis for back refrences) -Hoss
Re: Question about http://wiki.apache.org/solr/Deduplication
: Is it possible in solr to have multivalued id? Or I need to make my : own mv_ID for this? Any ideas how to achieve this efficiently? This isn't something the SignatureUpdateProcessor is going to be able to hel pyou with -- it does the deduplication be changing hte low level update (implemented as a delete then add) so that the key used to delete the older documents is based on the signature field instead of the id field. in order to do what you are describing, you would need to query the index for matching signatures, then add the resulting ids to your document before doing that update You could posibly do this in a custom UpdateProcessor, but you'd have to do something tricky to ensure you didn't overlook docs that had been addd but not yet committed when checking for dups. I don't have a good suggestion for how to do this internally in Slr -- it seems like the type of bulk processing logic that would be better suited for an external process before you ever start indexing (much like link analysis for back refrences) -Hoss
Deduplication questions
Q1. Is is possible to pass *analyzed* content to the public abstract class Signature { public void init(SolrParams nl) { } public abstract String calculate(String content); } Q2. Method calculate() is using concatenated fields from str name=fieldsname,features,cat/str Is there any mechanism I could build field dependant signatures? Use case for this: I have two fields: OWNER , TEXT I need to disable *fuzzy* duplicates for one owner, one clean way would be to make prefixed signature OWNER/FUZZY_SIGNATURE Is idea to make two UpdadeProcessors and chain them OK? (Is ugly, but would work) updateRequestProcessorChain name=signature_hard bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldexact_signature/str str name=fieldsOWNER/str str name=signatureClassExactSignature/str /processor /updateRequestProcessorChain hard_signature should not be stored and not indexed field updateRequestProcessorChain name=fuzzy_and_mix bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFieldmixed_signature/str str name=fieldsexact_signature, TEXT/str str name=signatureClassMixedSignature/str /processor /updateRequestProcessorChain field name=hard_signature type=string stored=false indexed=false multiValued=false / field name=mixed_signature type=string stored=true indexed=true multiValued=false / Assuming I know how long my exact_signature is, I could calculate fuzzy part and mix it properly. Possible, better ideas? Thanks, eks
Question about http://wiki.apache.org/solr/Deduplication
Hi, Use case I am trying to figure out is about preserving IDs without re-indexing on duplicate, rather adding this new ID under list of document id aliases. Example: Input collection: id:1, text:dummy text 1, signature:A id:2, text:dummy text 1, signature:A I add the first document in empty index, text is going to be indexed, ID is going to be 1, so far so good Now the question, if I add second document with id == 2, instead of deleting/indexing this new document, I would like to store id == 2 in multivalued Field id At the end, I would have one document less indexed and both ID are going to be searchable (and stored as well)... Is it possible in solr to have multivalued id? Or I need to make my own mv_ID for this? Any ideas how to achieve this efficiently? My target is not to add new documents if signature matches, but to have IDs indexed and stored? Thanks, eks
SOLR deduplication
Hi - I have the SOLR deduplication configured and working well. Is there any way I can tell which documents have been not added to the index as a result of the deduplication rejecting subsequent identical documents? Many Thanks Jason Brown. If you wish to view the St. James's Place email disclaimer, please use the link below http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
Re: SOLR deduplication
Not right now: https://issues.apache.org/jira/browse/SOLR-1909 Hi - I have the SOLR deduplication configured and working well. Is there any way I can tell which documents have been not added to the index as a result of the deduplication rejecting subsequent identical documents? Many Thanks Jason Brown. If you wish to view the St. James's Place email disclaimer, please use the link below http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
Re: Is deduplication possible during Tika extract?
In my opinion it should work for every update handler. If you're really sure your configuration if fine and it still doesn't work you might have to file an issue. Your configuration looks alright but don't forget you've configured overwriteDupes=false! Hello, here is an excerpt of my solrconfig.xml: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults str name=update.processordedupe/str !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler and updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature /str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain deduplication works when I use only /update but not when solr does an extract with Tika! Is deduplication possible during Tika extract? Thanks in advance, Arno
Is deduplication possible during Tika extract?
Hello, here is an excerpt of my solrconfig.xml: requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults str name=update.processordedupe/str !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler and updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsignature/str bool name=overwriteDupesfalse/bool str name=fieldstext/str str name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain deduplication works when I use only /update but not when solr does an extract with Tika! Is deduplication possible during Tika extract? Thanks in advance, Arno
Solr Deduplication and Field Collpasing
All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml field name=sig type=string stored=true indexed=true multiValued=true / Thanks so much for your help
RE: Solr Deduplication and Field Collpasing
You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml field name=sig type=string stored=true indexed=true multiValued=true / Thanks so much for your help
Re: Solr Deduplication and Field Collpasing
I have the digest field already in the schema because the index is shared between nutch docs and others. I do not know if the second approach is the quickest in my case. I can set the digest value to something unique for non nutch documets easily (I have an I'd field that I can use to populate the digest field during indxing of new non_nutch documets. I have custom tool that does the indexing of these docs). But I have more than3 millon documents in the index already that I don't want start over with new indexing again if I don't have to. Is there a way I can update the digest field with the value from the corresponding I'd field using solr? Thanks Raj - Original Message - From: Markus Jelsma markus.jel...@buyways.nl To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tue Sep 28 18:19:17 2010 Subject: RE: Solr Deduplication and Field Collpasing You could create a custom update processor that adds a digest field for newly added documents that do not have the digest field themselves. This way, the documents that are not added by Nutch get a proper non-empty digest field so the deduplication processor won't create the same empty hash and overwrite those. Or you could extend org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips documents with an empty digest field. I'd think the latter would be the quickest route but correct me if i'm wrong. Cheers, -Original message- From: Nemani, Raj raj.nem...@turner.com Sent: Tue 28-09-2010 23:28 To: solr-user@lucene.apache.org; Subject: Solr Deduplication and Field Collpasing All, I have setup Nutch to submit the crawl results to Solr index. I have some duplicates in the documents generated by the Nutch crawl. There is filed 'digest' that Nutch generates that is same for those documents that are duplicates. While setting up the the dedupe processor in the Solr config file, I have used this 'Digest' field in the following way(see below for config details). Since my index has documents other than the ones generated by Nutch I cannot use 'overwritedupes=true because for non-Nutch generated documents the digest field will not be populated and I found that Solr deletes every one of those documents that do not have the digest filed populated. Probably because they all will have the same 'sig' filed value generated based on an 'empty' digest field forcing Solr to delete everything? In any case, given the scenario I though I would set 'overwritedupes=false' and use field collapsing based on digest or sig filed but I could not get filed collapsing to work. Based on the wiki documentation I was adding the query string group=truegroup.filed=sig group=truegroup.filed=digest to my over all query in admin console and I still got the duplicate documents in the results. Is there anything special I need to do to get field collapsing working? I am running Solr 1.4. All this is because Nutch thinks that (url *is* the unique id for the nutch document) http://mysite.mydomain.com/index.html and http://mysite/index.html (the difference is only in the alias and for an internal site both are valid) are different documents depending on how the link is setup. This is reason for me to try deduplication. I cannot submit SolrDedup command from Nutch because non-Nutch generated documents do not have digest filed populated and I read on the mailing lists that this will cause the SolrDedup initiated from Nutch to fail. This forced me to do try deduplication on Solr side. Thanks so much in advance for your help. Here is my configuration: SolrConfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldsig/str bool name=overwriteDupesfalse/bool str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature /str str name=fieldsdigest/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler Schema.xml
Re: Deduplication
Basically for some uses cases I would like to show duplicates for other I wanted them ignored. If I have overwriteDupes=false and I just create the dedup hash how can I query for only unique hash values... ie something like a SQL group by. TermsComponent maybe? or faceting? q=*:*facet=truefacet.field=signatureFielddefType=lucenerows=0start=0 if you append facet.mincount=1 to above url you can see your duplications
Re: Deduplication
TermsComponent maybe? or faceting? q=*:*facet=truefacet.field=signatureFielddefType=lucenerows=0start=0 if you append facet.mincount=1 to above url you can see your duplications After re-reading your message: sometimes you want to show duplicates, sometimes you don't want them. I have never used FieldCollapsing by myself but heard about it many times. http://wiki.apache.org/solr/FieldCollapsing
Deduplication
Basically for some uses cases I would like to show duplicates for other I wanted them ignored. If I have overwriteDupes=false and I just create the dedup hash how can I query for only unique hash values... ie something like a SQL group by. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Deduplication-tp828016p828016.html Sent from the Solr - User mailing list archive at Nabble.com.
Config issue for deduplication
I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. What am I missing? Thanks Markus I did: - create a duplicated set of records, only shifted their ID by a fixed number --- solrconfig.xml requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFielddedupeHash/str str name=fieldsreference,issn/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain --- In schema.xml I added the field field name=dedupeHash type=string stored=true indexed=true multiValued=false / -- If I look at the created field dedupeHash it seems to be empty...!?
Re: Config issue for deduplication
I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. Does being imported means you are using dataimporthandler? If yes you can use this to enable DIH with dedupe. requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.processordedupe/str /lst /requestHandler
Re: Config issue for deduplication
Hmm, I can't find in solrconfig.xml anything about dataimporthandler for Vufind. So I suppose, no the import function does not use this method. Import is done by a script. Maybe I do not associate requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler with the correct requestHandler? I placed it directly after requestHandler name=/update class=solr.XmlUpdateRequestHandler / So kind of having twice this line. Markus Ahmet Arslan schrieb: I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. Does being imported means you are using dataimporthandler? If yes you can use this to enable DIH with dedupe. requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-config.xml/str str name=update.processordedupe/str /lst /requestHandler
RE: Config issue for deduplication
What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) -Original message- From: Markus Fischer i...@flyingfischer.ch Sent: Thu 13-05-2010 17:01 To: solr-user@lucene.apache.org; Subject: Config issue for deduplication I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. What am I missing? Thanks Markus I did: - create a duplicated set of records, only shifted their ID by a fixed number --- solrconfig.xml requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFielddedupeHash/str str name=fieldsreference,issn/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain --- In schema.xml I added the field field name=dedupeHash type=string stored=true indexed=true multiValued=false / -- If I look at the created field dedupeHash it seems to be empty...!?
Re: Config issue for deduplication
I use bool name=overwriteDupestrue/bool and a different field than ID to control duplication. This is about bibliographic data coming from different sources with different IDs which may have the same content... I attached solrconfig.xml if you want to take a look. Thanks a lot! Markus Markus Jelsma schrieb: What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) -Original message- From: Markus Fischer i...@flyingfischer.ch Sent: Thu 13-05-2010 17:01 To: solr-user@lucene.apache.org; Subject: Config issue for deduplication I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. What am I missing? Thanks Markus I did: - create a duplicated set of records, only shifted their ID by a fixed number --- solrconfig.xml requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFielddedupeHash/str str name=fieldsreference,issn/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain --- In schema.xml I added the field field name=dedupeHash type=string stored=true indexed=true multiValued=false / -- If I look at the created field dedupeHash it seems to be empty...!? ?xml version=1.0 encoding=UTF-8 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- config !-- Set this to 'false' if you want solr to continue working after it has encountered an severe configuration error. In a production environment, you may want solr to keep working even if one handler is mis-configured. You may also set this to false using by setting the system property: -Dsolr.abortOnConfigurationError=false -- abortOnConfigurationError${solr.abortOnConfigurationError:false}/abortOnConfigurationError !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration. -- dataDir${solr.solr.home:./solr}/biblio/dataDir indexDefaults !-- Values here affect all index writers and act as a default unless overridden. -- useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor !-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. -- !--maxBufferedDocs1000/maxBufferedDocs-- !-- Tell Lucene when to flush documents to disk. Giving Lucene more memory for indexing means faster indexing at the cost of more RAM If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. -- ramBufferSizeMB32/ramBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout !-- Expert: Turn on Lucene's auto commit capability. TODO: Add recommendations on why you would want to do this. NOTE: Despite the name, this value does not have any relation to Solr's autoCommit functionality -- !--luceneAutoCommitfalse/luceneAutoCommit-- !-- Expert: The Merge Policy in Lucene controls how merging is handled by Lucene. The default in 2.3 is the LogByteSizeMergePolicy, previous versions used LogDocMergePolicy. LogByteSizeMergePolicy chooses segments to merge based on their size. The Lucene 2.2 default, LogDocMergePolicy chose when to merge based on number of documents Other implementations of MergePolicy must have a no-argument constructor -- !--mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy
Re: [resolved] Config issue for deduplication
Got it with the help of Demian Katz, main developper of Vufind: The import script of Vufind was bypassing the duplication parameters while writing directly to the SOLR-Index. By deactivitating direct writing to the index and using the standard way it now works! Thanks to all who gave input! Markus Markus Fischer schrieb: I use bool name=overwriteDupestrue/bool and a different field than ID to control duplication. This is about bibliographic data coming from different sources with different IDs which may have the same content... I attached solrconfig.xml if you want to take a look. Thanks a lot! Markus Markus Jelsma schrieb: What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) -Original message- From: Markus Fischer i...@flyingfischer.ch Sent: Thu 13-05-2010 17:01 To: solr-user@lucene.apache.org; Subject: Config issue for deduplication I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. I followed: http://wiki.apache.org/solr/Deduplication Actually nothing happens. All records are being imported without any deduplication. What am I missing? Thanks Markus I did: - create a duplicated set of records, only shifted their ID by a fixed number --- solrconfig.xml requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst /requestHandler updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupestrue/bool str name=signatureFielddedupeHash/str str name=fieldsreference,issn/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain --- In schema.xml I added the field field name=dedupeHash type=string stored=true indexed=true multiValued=false / -- If I look at the created field dedupeHash it seems to be empty...!?
Re: Solr Cell and Deduplication - Get ID of doc
Thanks for the responses. This is exactly what I had to resort to. I will definitely put in a feature request to get the generated ID back from the extract request. I am doing this with PHP cURL for extraction and pecl php solr for querying. I am then saving the unique id and dupe hash in a MySQL table which I check against after the doc is indexed in Solr. If it is a dupe I delete the Solr record and discard the file. My problem now is the dupe hash sometimes comes back NULL from Solr although when I check it through Solr Admin it is there. I am working through this now to isolate. I had to set Solr to ALLOW duplicates because I have to somehow know that the file is a dupe and then remove the duplicate files on my filesystem. Based on the extract response I have no way of knowing this if duplicates are disallowed. -Bill On Tue, Mar 2, 2010 at 2:11 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after adding a doc. Using a unique literal.field value will work -- but only as the value of a secondary field that he can then query on to get the uniqueKeyField value. : : You could create your own unique ID and pass it in with the : : literal.field=value feature. : : By which Lance means you could specify an unique value in a differnet : field from yoru uniqueKey field, and then query on that field:value pair : to get the doc after it's been added -- but that query will only work : until some other version of the doc (with some other value) overwrites it. : so you'd esentially have to query for the field:value to lookup the : uniqueKey. : : it seems like it should definitely be feasible for the : Update RequestHandlers to return the uniqueKeyField values for all the : added docs (regardless of wether the key was included in the request, or : added by an UpdateProcessor -- but i'm not sure how that would fit in with : the SolrJ API. : : would you mind opening a feature request in Jira? : : : : -Hoss : : : : : : -- : Lance Norskog : goks...@gmail.com : -Hoss
Re: Solr Cell and Deduplication - Get ID of doc
: You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will only work until some other version of the doc (with some other value) overwrites it. so you'd esentially have to query for the field:value to lookup the uniqueKey. it seems like it should definitely be feasible for the Update RequestHandlers to return the uniqueKeyField values for all the added docs (regardless of wether the key was included in the request, or added by an UpdateProcessor -- but i'm not sure how that would fit in with the SolrJ API. would you mind opening a feature request in Jira? -Hoss
Re: Solr Cell and Deduplication - Get ID of doc
To quote from the wiki, http://wiki.apache.org/solr/ExtractingRequestHandler curl 'http://localhost:8983/solr/update/extract?literal.id=doc1commit=true' -F myfi...@tutorial.html This runs the extractor on your input file (in this case an HTML file). It then stores the generated document with the id field (the uniqueKey declared in schema.xml) set to 'doc1'. This way, you do not rely on the ExtractingRequestHandler to create a unique key for you. This command throws away that generated key. On Mon, Mar 1, 2010 at 4:22 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will only work until some other version of the doc (with some other value) overwrites it. so you'd esentially have to query for the field:value to lookup the uniqueKey. it seems like it should definitely be feasible for the Update RequestHandlers to return the uniqueKeyField values for all the added docs (regardless of wether the key was included in the request, or added by an UpdateProcessor -- but i'm not sure how that would fit in with the SolrJ API. would you mind opening a feature request in Jira? -Hoss -- Lance Norskog goks...@gmail.com
Re: Solr Cell and Deduplication - Get ID of doc
: To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after adding a doc. Using a unique literal.field value will work -- but only as the value of a secondary field that he can then query on to get the uniqueKeyField value. : : You could create your own unique ID and pass it in with the : : literal.field=value feature. : : By which Lance means you could specify an unique value in a differnet : field from yoru uniqueKey field, and then query on that field:value pair : to get the doc after it's been added -- but that query will only work : until some other version of the doc (with some other value) overwrites it. : so you'd esentially have to query for the field:value to lookup the : uniqueKey. : : it seems like it should definitely be feasible for the : Update RequestHandlers to return the uniqueKeyField values for all the : added docs (regardless of wether the key was included in the request, or : added by an UpdateProcessor -- but i'm not sure how that would fit in with : the SolrJ API. : : would you mind opening a feature request in Jira? : : : : -Hoss : : : : : : -- : Lance Norskog : goks...@gmail.com : -Hoss
Re: Solr Cell and Deduplication - Get ID of doc
Any thoughts on this? I would like to get the id back in the request after indexing. My initial thoughts were to do a search to get the docid based on the attr_stream_name after indexing but now that I reread my message I mentioned the attr_stream_name (file_name) may be different so that is unreliable. My only option is to somehow return the id in the XML response. Any guidance is greatly appreciated. -Bill On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote: Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsattr_content/str str name=signatureClassorg.apache.solr.update.processor./str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How do I get the id value post Solr processing. Is there someway to modify the curl response so that id is returned. I need this id because I would like to rename the file to the id value. I could probably do a Solr search after the fact to get the id field based on the attr_stream_name but I would like to do only one request. curl ' http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true' -F myfi...@myfile.pdf Thanks, Bill
Re: Solr Cell and Deduplication - Get ID of doc
You could create your own unique ID and pass it in with the literal.field=value feature. http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle billengle...@gmail.com wrote: Any thoughts on this? I would like to get the id back in the request after indexing. My initial thoughts were to do a search to get the docid based on the attr_stream_name after indexing but now that I reread my message I mentioned the attr_stream_name (file_name) may be different so that is unreliable. My only option is to somehow return the id in the XML response. Any guidance is greatly appreciated. -Bill On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote: Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsattr_content/str str name=signatureClassorg.apache.solr.update.processor./str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How do I get the id value post Solr processing. Is there someway to modify the curl response so that id is returned. I need this id because I would like to rename the file to the id value. I could probably do a Solr search after the fact to get the id field based on the attr_stream_name but I would like to do only one request. curl ' http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true' -F myfi...@myfile.pdf Thanks, Bill -- Lance Norskog goks...@gmail.com
Solr Cell and Deduplication - Get ID of doc
Hi - New Solr user here. I am using Solr Cell to index files (PDF, doc, docx, txt, htm, etc.) and there is a good chance that a new file will have duplicate content but not necessarily the same file name. To avoid this I am using the deduplication feature of Solr. updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupestrue/bool str name=fieldsattr_content/str str name=signatureClassorg.apache.solr.update.processor./str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain How do I get the id value post Solr processing. Is there someway to modify the curl response so that id is returned. I need this id because I would like to rename the file to the id value. I could probably do a Solr search after the fact to get the id field based on the attr_stream_name but I would like to do only one request. curl ' http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true' -F myfi...@myfile.pdf Thanks, Bill
Re: Deduplication in 1.4
Field collapsing has been used by many in their production environment. The last few months the stability of the patch grew as quiet some bugs were fixed. The only big feature missing currently is caching of the collapsing algorithm. I'm currently working on that and I will put it in a new patch in the coming next days. So yes the patch is very near being production ready. Martijn 2009/11/26 KaktuChakarabati jimmoe...@gmail.com: Hey Otis, Yep, I realized this myself after playing some with the dedupe feature yesterday. So it does look like Field collapsing is what I need pretty much. Any idea on how close it is to being production-ready? Thanks, -Chak Otis Gospodnetic wrote: Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: KaktuChakarabati jimmoe...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 24, 2009 5:29:00 PM Subject: Deduplication in 1.4 Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have. All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back according to relevancy). I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..) is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this? Anything will be helpful.. Thanks, Chak -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Deduplication in 1.4
Hi Martijn, - Original Message From: Martijn v Groningen martijn.is.h...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 26, 2009 3:19:40 AM Subject: Re: Deduplication in 1.4 Field collapsing has been used by many in their production environment. Got any pointers to public sites you know use it? I know of a high traffic site that used an early version, and it caused performance problems. Is double-tripping still required? The last few months the stability of the patch grew as quiet some bugs were fixed. The only big feature missing currently is caching of the collapsing algorithm. I'm currently working on that and Is it also full distributed-search-ready? I will put it in a new patch in the coming next days. So yes the patch is very near being production ready. Thanks, Otis Martijn 2009/11/26 KaktuChakarabati : Hey Otis, Yep, I realized this myself after playing some with the dedupe feature yesterday. So it does look like Field collapsing is what I need pretty much. Any idea on how close it is to being production-ready? Thanks, -Chak Otis Gospodnetic wrote: Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: KaktuChakarabati To: solr-user@lucene.apache.org Sent: Tue, November 24, 2009 5:29:00 PM Subject: Deduplication in 1.4 Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have. All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back according to relevancy). I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..) is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this? Anything will be helpful.. Thanks, Chak -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Deduplication in 1.4
Two sites that use field-collapsing: 1) www.ilocal.nl 2) www.welke.nl I'm not sure what you mean with double-tripping? The sites mentioned do not have performance problems that are caused by field collapsing. Field-collapsing currently only supports quasi distributed field-collapsing (as I have described on the Solr wiki). Currently I don't know a distributed field-collapsing algorithm that works properly and does not influence the search time in such a way that the search becomes slow. Martijn 2009/11/26 Otis Gospodnetic otis_gospodne...@yahoo.com: Hi Martijn, - Original Message From: Martijn v Groningen martijn.is.h...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 26, 2009 3:19:40 AM Subject: Re: Deduplication in 1.4 Field collapsing has been used by many in their production environment. Got any pointers to public sites you know use it? I know of a high traffic site that used an early version, and it caused performance problems. Is double-tripping still required? The last few months the stability of the patch grew as quiet some bugs were fixed. The only big feature missing currently is caching of the collapsing algorithm. I'm currently working on that and Is it also full distributed-search-ready? I will put it in a new patch in the coming next days. So yes the patch is very near being production ready. Thanks, Otis Martijn 2009/11/26 KaktuChakarabati : Hey Otis, Yep, I realized this myself after playing some with the dedupe feature yesterday. So it does look like Field collapsing is what I need pretty much. Any idea on how close it is to being production-ready? Thanks, -Chak Otis Gospodnetic wrote: Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: KaktuChakarabati To: solr-user@lucene.apache.org Sent: Tue, November 24, 2009 5:29:00 PM Subject: Deduplication in 1.4 Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have. All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back according to relevancy). I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..) is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this? Anything will be helpful.. Thanks, Chak -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Deduplication in 1.4
Hey Otis, Yep, I realized this myself after playing some with the dedupe feature yesterday. So it does look like Field collapsing is what I need pretty much. Any idea on how close it is to being production-ready? Thanks, -Chak Otis Gospodnetic wrote: Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: KaktuChakarabati jimmoe...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 24, 2009 5:29:00 PM Subject: Deduplication in 1.4 Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have. All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back according to relevancy). I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..) is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this? Anything will be helpful.. Thanks, Chak -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html Sent from the Solr - User mailing list archive at Nabble.com.
Deduplication in 1.4
Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have. All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back according to relevancy). I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..) is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this? Anything will be helpful.. Thanks, Chak -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Deduplication in 1.4
Hi, As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place. What you are describing is closer to field collapsing patch in SOLR-236. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: KaktuChakarabati jimmoe...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, November 24, 2009 5:29:00 PM Subject: Deduplication in 1.4 Hey, I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse.. In specific, here's what i'm trying to do: I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have. All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back according to relevancy). I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..) is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this? Anything will be helpful.. Thanks, Chak -- View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html Sent from the Solr - User mailing list archive at Nabble.com.
Conditional deduplication
If I index a bunch of email documents, is there a way to sayshow me all email documents, but only one per To: email address so that if there are a total of 10 distinct To: fields in the corpus, I get back 10 email documents? I'm aware of http://wiki.apache.org/solr/Deduplication but I want to retain the ability to search across all of my email documents most of the time, and only occasionally search for the distinct ones. Essentially I want to do a SELECT DISTINCT to_field FROM documents where a normal search is a SELECT * FROM documents Thanks for any pointers.
Re: Conditional deduplication
See http://wiki.apache.org/solr/FieldCollapsing On Wed, Sep 30, 2009 at 4:41 PM, Michael solrco...@gmail.com wrote: If I index a bunch of email documents, is there a way to sayshow me all email documents, but only one per To: email address so that if there are a total of 10 distinct To: fields in the corpus, I get back 10 email documents? I'm aware of http://wiki.apache.org/solr/Deduplication but I want to retain the ability to search across all of my email documents most of the time, and only occasionally search for the distinct ones. Essentially I want to do a SELECT DISTINCT to_field FROM documents where a normal search is a SELECT * FROM documents Thanks for any pointers.
Re: stress tests to DIH and deduplication patch
I have already ran out of memory after a cronjob indexing as much times as possible during a day. Will activate GC loggin to see what it says... Thnks! Shalin Shekhar Mangar wrote: On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Hey there, I am doing some stress tests indexing with DIH. I am indexing a mysql DB with 140 rows aprox. I am using also the DeDuplication patch. I am using tomcat with JVM limit of -Xms2000M -Xmx2000M I have indexed 3 times using full-import command without restarting tomcat or reloading the core between the indexations. I have used jmap and jhat to map heap memory in some moments of the indexations. Here I show the beginig of the maps (I don't show the lower part of the stack because object instance numbers are completely stable in there). I have noticed that the number of Term, TermInfo and TermQuery grows between an indexation and another... is that normal? Perhaps you should enable GC logging as well. Also, did you actually run out of memory or you are interpolating and assuming that it might happen? -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/stress-tests-to-DIH-and-deduplication-patch-tp23295926p23314604.html Sent from the Solr - User mailing list archive at Nabble.com.
stress tests to DIH and deduplication patch
Hey there, I am doing some stress tests indexing with DIH. I am indexing a mysql DB with 140 rows aprox. I am using also the DeDuplication patch. I am using tomcat with JVM limit of -Xms2000M -Xmx2000M I have indexed 3 times using full-import command without restarting tomcat or reloading the core between the indexations. I have used jmap and jhat to map heap memory in some moments of the indexations. Here I show the beginig of the maps (I don't show the lower part of the stack because object instance numbers are completely stable in there). I have noticed that the number of Term, TermInfo and TermQuery grows between an indexation and another... is that normal? FIRST TIME I INDEX... WITH A MILION INDEXED DOCS APROX... HERE INDEXING PROCESS IS STILL RUNNING 268290 instances of class org.apache.lucene.index.Term 215943 instances of class org.apache.lucene.index.TermInfo 129649 instances of class org.apache.lucene.index.FreqProxTermsWriter$PostingList 51537 instances of class org.apache.lucene.search.TermQuery 25457 instances of class org.apache.lucene.index.BufferedDeletes$Num 23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry 1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry 1120 instances of class org.apache.lucene.index.FieldInfo 919 instances of class org.apache.catalina.loader.ResourceEntry FIRST TIME I INDEX, COMPLETED (1.4 MILION DOCS INDEXED) 552522 instances of class org.apache.lucene.index.Term 505835 instances of class org.apache.lucene.index.TermInfo 128937 instances of class org.apache.lucene.index.FreqProxTermsWriter$PostingList 48645 instances of class org.apache.lucene.search.TermQuery 24065 instances of class org.apache.lucene.index.BufferedDeletes$Num 23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry 1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry 1470 instances of class org.apache.lucene.index.FieldInfo 923 instances of class org.apache.catalina.loader.ResourceEntry 858 instances of class com.sun.tools.javac.util.List SECOND TIME I INDEX WITH 50 INDEXED DOCS... HERE INDEX PROCESS IS STILL RUNNING 264617 instances of class org.apache.lucene.index.FreqProxTermsWriter$PostingList 262496 instances of class org.apache.lucene.index.Term 116078 instances of class org.apache.lucene.index.TermInfo 53383 instances of class org.apache.lucene.search.TermQuery 42274 instances of class org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput 30230 instances of class org.apache.lucene.search.TermQuery$TermWeight 26044 instances of class org.apache.lucene.index.BufferedDeletes$Num 23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry 15115 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator 15115 instances of class org.apache.lucene.search.ReqExclScorer 7325 instances of class org.apache.lucene.search.ConjunctionScorer$1 1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry 1279 instances of class org.apache.lucene.index.FieldInfo 923 instances of class org.apache.catalina.loader.ResourceEntry SECOND TIME I INDEX WITH 120 INDEXED DOCS... HERE INDEX PROCESS IS STILL RUNNING 574603 instances of class org.apache.lucene.index.Term 423558 instances of class org.apache.lucene.index.TermInfo 141394 instances of class org.apache.lucene.index.FreqProxTermsWriter$PostingList 106729 instances of class org.apache.lucene.search.TermQuery 54858 instances of class org.apache.lucene.index.BufferedDeletes$Num 25347 instances of class org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput 23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry 11587 instances of class org.apache.lucene.search.TermQuery$TermWeight 5793 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator 5793 instances of class org.apache.lucene.search.ReqExclScorer 2922 instances of class org.apache.lucene.search.ConjunctionScorer$1 2170 instances of class org.apache.lucene.index.FieldInfo 1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry 923 instances of class org.apache.catalina.loader.ResourceEntry 858 instances of class com.sun.tools.javac.util.List SECOND TIME I INDEX, COMPLETED (1.4 MILION DOCS INDEXED) 999753 instances of class org.apache.lucene.index.Term 808190 instances of class org.apache.lucene.index.TermInfo 156511 instances of class org.apache.lucene.search.TermQuery 128975 instances of class org.apache.lucene.index.FreqProxTermsWriter$PostingList 104396 instances of class org.apache.lucene.index.BufferedDeletes$Num 23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry 15401 instances of class org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput 14896 instances of class org.apache.lucene.search.TermQuery$TermWeight 7447 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator 7447 instances of class org.apache.lucene.search.ReqExclScorer 3025 instances of class
Re: stress tests to DIH and deduplication patch
On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Hey there, I am doing some stress tests indexing with DIH. I am indexing a mysql DB with 140 rows aprox. I am using also the DeDuplication patch. I am using tomcat with JVM limit of -Xms2000M -Xmx2000M I have indexed 3 times using full-import command without restarting tomcat or reloading the core between the indexations. I have used jmap and jhat to map heap memory in some moments of the indexations. Here I show the beginig of the maps (I don't show the lower part of the stack because object instance numbers are completely stable in there). I have noticed that the number of Term, TermInfo and TermQuery grows between an indexation and another... is that normal? Perhaps you should enable GC logging as well. Also, did you actually run out of memory or you are interpolating and assuming that it might happen? -- Regards, Shalin Shekhar Mangar.
Re: Deduplication patch not working in nightly build
I've seen similar errors when large background merges happen while looping in a result set. See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/ On Jan 9, 2009, at 12:50 PM, Mark Miller wrote: Your basically writing segments more often now, and somehow avoiding a longer merge I think. Also, likely, deduplication is probably adding enough extra data to your index to hit a sweet spot where a merge is too long. Or something to that effect - MySql is especially sensitive to timeouts when doing a select * on a huge db in my testing. I didnt understand your answer on the autocommit - I take it you are using it? Or no? All a guess, but it def points to a merge taking a bit long and causing a timeout. I think you can relax the MySql timeout settings if that is it. I'd like to get to the bottom of this as well, so any other info you can provide would be great. - Mark Marc Sturlese wrote: Hey Shalin, In the begining (when the error was appearing) i had ramBufferSizeMB32/ramBufferSizeMB and no maxBufferedDocs set Now I have: ramBufferSizeMB32/ramBufferSizeMB maxBufferedDocs50/maxBufferedDocs I think taht setting maxBufferedDocs to 50 I am forcing more disk writting than I would like... but at least it works fine (but a bit slower,opiously). I keep saying that the most weird thing is that I don't have that problem using solr1.3, just with the nightly... Even that it's good that it works well now, would be great if someone can give me an explanation why this is happening Shalin Shekhar Mangar wrote: On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote: hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? What I find strange is this line in the exception: Last packet sent to the server was 202481 ms ago. Something took very very long to complete and the connection got closed by the time the next row was fetched from the opened resultset. Just curious, what was the previous value of maxBufferedDocs and what did you change it to? -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Deduplication patch not working in nightly build
Hey there, I am stack in this problem sine 3 days ago and no idea how to sort it. I am using the nighlty from a week ago, mysql and this driver and url: driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/my_db I can use deduplication patch with indexs of 200.000 docs and no problem. When I try a full-import with a db of 1.500.000 it stops indexing at doc number 15.000 aprox showing me the error posted above. Once I get the exception, i restart tomcat and start a delta-import... this time everything works fine! I need to avoid this error in the full import, i have tryed: url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the connection was closed due to long time until next doc was indexed, but nothing changed... I keep having this: Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428) ** END NESTED EXCEPTION ** Last packet sent to the server was 206097 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428) Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289
Re: Deduplication patch not working in nightly build
I can't imagine why dedupe would have anything to do with this, other than what was said, it perhaps is taking a bit longer to get a document to the db, and it times out (maybe a long signature calculation?). Have you tried changing your MySql settings to allow for a longer timeout? (sorry, I'm not to up to date on what you have tried). Also, are you using autocommit during the import? If so, you might try turning it off for the full import. - Mark Marc Sturlese wrote: Hey there, I am stack in this problem sine 3 days ago and no idea how to sort it. I am using the nighlty from a week ago, mysql and this driver and url: driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/my_db I can use deduplication patch with indexs of 200.000 docs and no problem. When I try a full-import with a db of 1.500.000 it stops indexing at doc number 15.000 aprox showing me the error posted above. Once I get the exception, i restart tomcat and start a delta-import... this time everything works fine! I need to avoid this error in the full import, i have tryed: url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the connection was closed due to long time until next doc was indexed, but nothing changed... I keep having this: Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428) ** END NESTED EXCEPTION ** Last packet sent to the server was 206097 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428) Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException
Re: Deduplication patch not working in nightly build
hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? Thanks Marc Sturlese wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362
Re: Deduplication patch not working in nightly build
On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote: hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? What I find strange is this line in the exception: Last packet sent to the server was 202481 ms ago. Something took very very long to complete and the connection got closed by the time the next row was fetched from the opened resultset. Just curious, what was the previous value of maxBufferedDocs and what did you change it to? -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Deduplication patch not working in nightly build
Hey Shalin, In the begining (when the error was appearing) i had ramBufferSizeMB32/ramBufferSizeMB and no maxBufferedDocs set Now I have: ramBufferSizeMB32/ramBufferSizeMB maxBufferedDocs50/maxBufferedDocs I think taht setting maxBufferedDocs to 50 I am forcing more disk writting than I would like... but at least it works fine (but a bit slower,opiously). I keep saying that the most weird thing is that I don't have that problem using solr1.3, just with the nightly... Even that it's good that it works well now, would be great if someone can give me an explanation why this is happening Shalin Shekhar Mangar wrote: On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote: hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? What I find strange is this line in the exception: Last packet sent to the server was 202481 ms ago. Something took very very long to complete and the connection got closed by the time the next row was fetched from the opened resultset. Just curious, what was the previous value of maxBufferedDocs and what did you change it to? -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21376235.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Deduplication patch not working in nightly build
Your basically writing segments more often now, and somehow avoiding a longer merge I think. Also, likely, deduplication is probably adding enough extra data to your index to hit a sweet spot where a merge is too long. Or something to that effect - MySql is especially sensitive to timeouts when doing a select * on a huge db in my testing. I didnt understand your answer on the autocommit - I take it you are using it? Or no? All a guess, but it def points to a merge taking a bit long and causing a timeout. I think you can relax the MySql timeout settings if that is it. I'd like to get to the bottom of this as well, so any other info you can provide would be great. - Mark Marc Sturlese wrote: Hey Shalin, In the begining (when the error was appearing) i had ramBufferSizeMB32/ramBufferSizeMB and no maxBufferedDocs set Now I have: ramBufferSizeMB32/ramBufferSizeMB maxBufferedDocs50/maxBufferedDocs I think taht setting maxBufferedDocs to 50 I am forcing more disk writting than I would like... but at least it works fine (but a bit slower,opiously). I keep saying that the most weird thing is that I don't have that problem using solr1.3, just with the nightly... Even that it's good that it works well now, would be great if someone can give me an explanation why this is happening Shalin Shekhar Mangar wrote: On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote: hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? What I find strange is this line in the exception: Last packet sent to the server was 202481 ms ago. Something took very very long to complete and the connection got closed by the time the next row was fetched from the opened resultset. Just curious, what was the previous value of maxBufferedDocs and what did you change it to? -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Deduplication patch not working in nightly build
Hey Mark, Sorry I was not enough especific, I wanted to mean that I have and I always had autoCommit=false. I will do some more traces and test. Will post if I have any new important thing to mention. Thanks. Marc Sturlese wrote: Hey Shalin, In the begining (when the error was appearing) i had ramBufferSizeMB32/ramBufferSizeMB and no maxBufferedDocs set Now I have: ramBufferSizeMB32/ramBufferSizeMB maxBufferedDocs50/maxBufferedDocs I think taht setting maxBufferedDocs to 50 I am forcing more disk writting than I would like... but at least it works fine (but a bit slower,opiously). I keep saying that the most weird thing is that I don't have that problem using solr1.3, just with the nightly... Even that it's good that it works well now, would be great if someone can give me an explanation why this is happening Shalin Shekhar Mangar wrote: On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote: hey there, I hadn't autoCommit set to true but I have it sorted! The error stopped appearing after setting the property maxBufferedDocs in solrconfig.xml. I can't exactly undersand why but it just worked. Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same? What I find strange is this line in the exception: Last packet sent to the server was 202481 ms ago. Something took very very long to complete and the connection got closed by the time the next row was fetched from the opened resultset. Just curious, what was the previous value of maxBufferedDocs and what did you change it to? -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- View this message in context: http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21378069.html Sent from the Solr - User mailing list archive at Nabble.com.
Deduplication patch not working in nightly build
Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150) at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488) at com.mysql.jdbc.ResultSet.close(ResultSet.java:736) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.close(JdbcDataSource.java:312
Re: Deduplication patch not working in nightly build
Thanks I will have a look to my JdbcDataSource. Anyway it's weird because using the 1.3 release I don't have that problem... Shalin Shekhar Mangar wrote: Yes, initially I figured that we are accidentally re-using a closed data source. But Noble has pinned it right. I guess you can try looking into your JDBC driver's documentation for a setting which increases the connection alive-ness. On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: I guess the indexing of a doc is taking too long (may be because of the de-dup patch) and the resultset gets closed automaticallly (timed out) --Noble On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Donig this fix I get the same error :( I am going to try to set up the last nigthly build... let's see if I have better luck. The thing is it stop indexing at the doc num 150.000 aprox... and give me that mysql exception error... Without DeDuplication patch I can index 2 milion docs without problems... I am pretty lost with this... :( Shalin Shekhar Mangar wrote: Yes I meant the 05/01/2008 build. The fix is a one line change Add the following as the last line of DataConfig.Entity.clearCache() dataSrc = null; On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one works? If the fix you did is not really big can u tell me where in the source is and what is it for? (I have been debuging and tracing a lot the dataimporthandler source and I I would like to know what the imporovement is about if it is not a problem...) Thanks! Shalin Shekhar Mangar wrote: Marc, I've just committed a fix which may have caused the bug. Can you use svn trunk (or the next nightly build) and confirm? On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java
Re: Deduplication patch not working in nightly build
Yeah looks like but... if I don't use the DeDuplication patch everything works perfect. I can create my indexed using full import and delta import without problems. The JdbcDataSource of the nightly is pretty similar to the 1.3 release... The DeDuplication patch doesn't touch the dataimporthandler classes... it's coz I thought the problem was not there (but can't say it for sure...) I was thinking that the problem has something to do with the UpdateRequestProcessorChain but don't know how this part of the source works... I am really interested in updating to the nightly build as I think new facet algorithm and SolrDeletionPolicy are really great stuff! Marc, I've just committed a fix which may have caused the bug. Can you use svn trunk (or the next nightly build) and confirm? You meann the last nightly build? Thanks Noble Paul നോബിള് नोब्ळ् wrote: looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING
Re: Deduplication patch not working in nightly build
looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150) at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488
Re: Deduplication patch not working in nightly build
I guess the indexing of a doc is taking too long (may be because of the de-dup patch) and the resultset gets closed automaticallly (timed out) --Noble On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Donig this fix I get the same error :( I am going to try to set up the last nigthly build... let's see if I have better luck. The thing is it stop indexing at the doc num 150.000 aprox... and give me that mysql exception error... Without DeDuplication patch I can index 2 milion docs without problems... I am pretty lost with this... :( Shalin Shekhar Mangar wrote: Yes I meant the 05/01/2008 build. The fix is a one line change Add the following as the last line of DataConfig.Entity.clearCache() dataSrc = null; On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese marc.sturl...@gmail.comwrote: Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one works? If the fix you did is not really big can u tell me where in the source is and what is it for? (I have been debuging and tracing a lot the dataimporthandler source and I I would like to know what the imporovement is about if it is not a problem...) Thanks! Shalin Shekhar Mangar wrote: Marc, I've just committed a fix which may have caused the bug. Can you use svn trunk (or the next nightly build) and confirm? On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225
Re: Deduplication patch not working in nightly build
Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one works? If the fix you did is not really big can u tell me where in the source is and what is it for? (I have been debuging and tracing a lot the dataimporthandler source and I I would like to know what the imporovement is about if it is not a problem...) Thanks! Shalin Shekhar Mangar wrote: Marc, I've just committed a fix which may have caused the bug. Can you use svn trunk (or the next nightly build) and confirm? On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION
Re: Deduplication patch not working in nightly build
Marc, I've just committed a fix which may have caused the bug. Can you use svn trunk (or the next nightly build) and confirm? On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: looks like a bug w/ DIH with the recent fixes. --Noble On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I was using the Deduplication patch with Solr 1.3 release and everything was working perfectly. Now I upgraded to a nigthly build (20th december) to be able to use new facet algorithm and other stuff and DeDuplication is not working any more. I have followed exactly the same steps to apply the patch to the source code. I am geting this error: WARNING: Error reading data com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) ** END NESTED EXCEPTION ** Last packet sent to the server was 202481 ms ago. at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771) at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289) at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362) at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352) at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388) Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource logError WARNING: Exception while closing result set com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771