Re: deduplication of suggester results are not enough

2020-03-26 Thread Michal Hlavac
Hi Roland,

I wrote AnalyzingInfixSuggester that deduplicates data on several levels at 
index time.
I will publish it in few days on github. I'll wrote to this thread when done.

m.

On štvrtok 26. marca 2020 16:01:57 CET Szűcs Roland wrote:
> Hi All,
> 
> I follow the discussion of the suggester related discussions quite a while
> ago. Everybody agrees that it is not the expected behaviour from a
> Suggester where the terms are the entities and not the documents to return
> the same string representation several times.
> 
> One suggestion was to make deduplication on client side of Solr. It is very
> easy in most of the client solution as any set based data structure solve
> this.
> 
> *But one important problem is not solved the deduplication: suggest.count*.
> 
> If I have15 matches by the suggester and the suggest.count=10 and the first
> 9 matches are the same, I will get back only 2 after the deduplication and
> the remaining 5 unique terms will be never shown.
> 
> What is the solution for this?
> 
> Cheers,
> Roland
> 


deduplication of suggester results are not enough

2020-03-26 Thread Szűcs Roland
Hi All,

I follow the discussion of the suggester related discussions quite a while
ago. Everybody agrees that it is not the expected behaviour from a
Suggester where the terms are the entities and not the documents to return
the same string representation several times.

One suggestion was to make deduplication on client side of Solr. It is very
easy in most of the client solution as any set based data structure solve
this.

*But one important problem is not solved the deduplication: suggest.count*.

If I have15 matches by the suggester and the suggest.count=10 and the first
9 matches are the same, I will get back only 2 after the deduplication and
the remaining 5 unique terms will be never shown.

What is the solution for this?

Cheers,
Roland


Atomic update deletes deduplication signature

2018-08-09 Thread Thomas Eckart

Hello,

I am having trouble when doing atomic updates in combination with 
SignatureUpdateProcessorFactory (on Solr 7.2). Normal commits of new 
documents work as expected and generate a valid signature:


curl "$URL/update?commit=true" -H 'Content-type:application/json' -d 
'{"add":{"doc":{"id": "TEST_ID1", "description": "description", 
"country": "country"}}}' && curl "$URL/select?q=id:TEST_ID1"


"response":{"numFound":1,"start":0,"docs":[
{
   "id":"TEST_ID1",
   "description":["description"],
   "country":["country"],
   "_signature":"e577e465b9099ba8",  <-- valid signature
   "_version_":1608322850016460800}]
}}

However, when updating a field (that is not used for generating the 
signature) the signature is replaced by "":


curl "$URL/update?commit=true" -H 'Content-type:application/json' -d 
'{"add":{"doc":{"id": "TEST_ID1", "country": {"set": "country2"' && 
curl "$URL/select?q=id:TEST_ID1"


"response":{"numFound":1,"start":0,"docs":[
{
   "id":"TEST_ID1",
   "description":["description"],
   "country":["country2"],
   "_signature":"",  <-- broken signature
   "_version_":1608322857485467648}]
}}

This looks a lot like the second problem mentioned in an old Solr JIRA 
issue ([1]). Unfortunately, there is no relevant response in the 
discussion there.

Any ideas how to fix this?

Thank you,
Thomas


solrconfig.xml:

[...]
   
  true
  _signature
  false
  description
  solr.processor.Lookup3Signature
   
   
   



[1] https://issues.apache.org/jira/browse/SOLR-4016


RE: Solr Cloud: query elevation + deduplication?

2018-03-06 Thread Markus Jelsma
Hi,

I would not use ID (uniqueKey) as signature field, query elevation would never 
work properly with such a set up, change a document's content, and it 'll get a 
new ID.

If i remember correctly this factory still deletes duplicates if signatureField 
is not uniqueKey.

Regarding SOLR-3473, nobody seems to be working on that.

Regards,
Markus
 
-Original message-
> From:Ronja Koistinen <ronja.koisti...@helsinki.fi>
> Sent: Monday 5th March 2018 15:32
> To: solr-user@lucene.apache.org
> Subject: Solr Cloud: query elevation + deduplication?
> 
> Hello,
> 
> I am running Solr Cloud 6.6.2 and trying to get query elevation and
> deduplication (with SignatureUpdateProcessor) working at the same time.
> 
> The documentation for deduplication
> (https://lucene.apache.org/solr/guide/6_6/de-duplication.html) does not
> specify if the signatureField needs to be the uniqueKey field configured
> in my schema.xml. Currently I have my uniqueKey set to the field
> containing the url of my documents.
> 
> The query elevation seems to reference documents by the uniqueKey in the
> "id" attributes listed in elevate.xml, so having the uniqueKey be the
> url would be beneficial to my process of maintaining the query elevation
> list.
> 
> Also, what is the status of this issue I found?
> https://issues.apache.org/jira/browse/SOLR-3473
> 
> -- 
> Ronja Koistinen
> University of Helsinki
> 
> 


Solr Cloud: query elevation + deduplication?

2018-03-05 Thread Ronja Koistinen
Hello,

I am running Solr Cloud 6.6.2 and trying to get query elevation and
deduplication (with SignatureUpdateProcessor) working at the same time.

The documentation for deduplication
(https://lucene.apache.org/solr/guide/6_6/de-duplication.html) does not
specify if the signatureField needs to be the uniqueKey field configured
in my schema.xml. Currently I have my uniqueKey set to the field
containing the url of my documents.

The query elevation seems to reference documents by the uniqueKey in the
"id" attributes listed in elevate.xml, so having the uniqueKey be the
url would be beneficial to my process of maintaining the query elevation
list.

Also, what is the status of this issue I found?
https://issues.apache.org/jira/browse/SOLR-3473

-- 
Ronja Koistinen
University of Helsinki



signature.asc
Description: OpenPGP digital signature


Re: Deduplication

2015-05-20 Thread Bram Van Dam
 Write a custom update processor and include it in your update chain.
 You will then have the ability to do anything you want with the entire
 input document before it hits the code to actually do the indexing.

This sounded like the perfect option ... until I read Jack's comment:


 My understanding was that the distributed update processor is near the end
 of the chain, so that running of user update processors occurs before the
 distribution step, but is that distribution to the leader, or distribution
 from leader to replicas for a shard?

That would pose some potential problems.

Would a custom update processor make the solution cloud-safe?

Thx,

 - Bram



Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote:
 Hi Bram,
 what do you mean with :
   I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values  .
 
 This is not reduplication, but simple document filtering based on a
 constraint.
 In the case you want de-duplication ( which seemed from your very first
 part of the mail) here you can find a lot of info :

Not sure whether de-duplication is the right word for what I'm after, I
essentially want a unique constraint on an arbitrary field. Without
overwrite semantics, because I want Solr to tell me if a duplicate is
sent to Solr.

I was thinking that the de-duplication feature could accomplish this
somehow.


 - Bram


Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document
in input an Hash ( based on a set of fields).
You can then select two options :
 - Index everything, documents with same signature will be equals
- avoid the overwriting of duplicates.

How the similarity has is calculated is something you can play with and
customise if needed.

Clarified that, do you think can fit in some way, or definitely you are not
talking about deduce ?

2015-05-20 8:37 GMT+01:00 Bram Van Dam bram.van...@intix.eu:

 On 19/05/15 14:47, Alessandro Benedetti wrote:
  Hi Bram,
  what do you mean with :
I
  would like it to provide the unique value myself, without having the
  deduplicator create a hash of field values  .
 
  This is not reduplication, but simple document filtering based on a
  constraint.
  In the case you want de-duplication ( which seemed from your very first
  part of the mail) here you can find a lot of info :

 Not sure whether de-duplication is the right word for what I'm after, I
 essentially want a unique constraint on an arbitrary field. Without
 overwrite semantics, because I want Solr to tell me if a duplicate is
 sent to Solr.

 I was thinking that the de-duplication feature could accomplish this
 somehow.


  - Bram




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam bram.van...@intix.eu wrote:

  Write a custom update processor and include it in your update chain.
  You will then have the ability to do anything you want with the entire
  input document before it hits the code to actually do the indexing.

 This sounded like the perfect option ... until I read Jack's comment:

 
  My understanding was that the distributed update processor is near the
 end
  of the chain, so that running of user update processors occurs before the
  distribution step, but is that distribution to the leader, or
 distribution
  from leader to replicas for a shard?

 That would pose some potential problems.

 Would a custom update processor make the solution cloud-safe?


Starting with Solr 5.1, you have the ability to specify an update processor
on the fly to requests and you can even control whether it is to be
executed before any distribution happens or before it is actually indexed
on the replica.

e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to
have processor xyz run first and then MyCustomUpdateProc and then the
default update processor chain (which will also distribute the doc to the
leader or from the leader to a replica). This also means that such
processors will not be executed on the replicas at all.

You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and
MyCustomUpdateProc to run on each replica (including the leader) right
before the doc is indexed (i.e. just before RunUpdateProcessor)

Unfortunately, due to an oversight, this feature hasn't been documented
well which is something I'll fix. See
https://issues.apache.org/jira/browse/SOLR-6892 for more details.



 Thx,

  - Bram




-- 
Regards,
Shalin Shekhar Mangar.


Deduplication

2015-05-19 Thread Bram Van Dam
Hi folks,

I'm looking for a way to have Solr reject documents if a certain field
value is duplicated (reject, not overwrite). There doesn't seem to be
any kind of unique option in schema fields.

The de-duplication feature seems to make this (somewhat) possible, but I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values.

Am I missing an obvious (or less obvious) way of accomplishing this?

Thanks,

 - Bram


Re: Deduplication

2015-05-19 Thread Alessandro Benedetti
Hi Bram,
what do you mean with :
  I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values  .

This is not reduplication, but simple document filtering based on a
constraint.
In the case you want de-duplication ( which seemed from your very first
part of the mail) here you can find a lot of info :

https://cwiki.apache.org/confluence/display/solr/De-Duplication

Let me know for more detailed requirements!

2015-05-19 10:02 GMT+01:00 Bram Van Dam bram.van...@intix.eu:

 Hi folks,

 I'm looking for a way to have Solr reject documents if a certain field
 value is duplicated (reject, not overwrite). There doesn't seem to be
 any kind of unique option in schema fields.

 The de-duplication feature seems to make this (somewhat) possible, but I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values.

 Am I missing an obvious (or less obvious) way of accomplishing this?

 Thanks,

  - Bram




-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?

William Blake - Songs of Experience -1794 England


Re: Deduplication

2015-05-19 Thread Jack Krupansky
Shawn, I was going to say the same thing, but... then I was thinking about
SolrCloud and the fact that update processors are invoked before the
document is set to its target node, so there wouldn't be a reliable way to
tell if the input document field value exists on the target rather than
current node.

Or does the update processing only occur on the leader node after being
forwarded from the originating node? Is the doc clear on this detail?

My understanding was that the distributed update processor is near the end
of the chain, so that running of user update processors occurs before the
distribution step, but is that distribution to the leader, or distribution
from leader to replicas for a shard?


-- Jack Krupansky

On Tue, May 19, 2015 at 9:01 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 5/19/2015 3:02 AM, Bram Van Dam wrote:
  I'm looking for a way to have Solr reject documents if a certain field
  value is duplicated (reject, not overwrite). There doesn't seem to be
  any kind of unique option in schema fields.
 
  The de-duplication feature seems to make this (somewhat) possible, but I
  would like it to provide the unique value myself, without having the
  deduplicator create a hash of field values.
 
  Am I missing an obvious (or less obvious) way of accomplishing this?

 Write a custom update processor and include it in your update chain.
 You will then have the ability to do anything you want with the entire
 input document before it hits the code to actually do the indexing.

 A script update processor is included with Solr allows you to write your
 processor in a language other than Java, such as javascript.


 https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

 Here's how to discard a document in an update processor written in Java:


 http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor

 The javadoc that I linked above describes the ability to return false
 in other languages to discard the document.

 Thanks,
 Shawn




Re: Deduplication

2015-05-19 Thread Shawn Heisey
On 5/19/2015 3:02 AM, Bram Van Dam wrote:
 I'm looking for a way to have Solr reject documents if a certain field
 value is duplicated (reject, not overwrite). There doesn't seem to be
 any kind of unique option in schema fields.
 
 The de-duplication feature seems to make this (somewhat) possible, but I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values.
 
 Am I missing an obvious (or less obvious) way of accomplishing this?

Write a custom update processor and include it in your update chain.
You will then have the ability to do anything you want with the entire
input document before it hits the code to actually do the indexing.

A script update processor is included with Solr allows you to write your
processor in a language other than Java, such as javascript.

https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

Here's how to discard a document in an update processor written in Java:

http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor

The javadoc that I linked above describes the ability to return false
in other languages to discard the document.

Thanks,
Shawn



any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?

2014-03-17 Thread Mobius ReX
For example, given a new big department merged from three departments. A
few employees worked for two or three departments before merging. That
means, the attributes of one person might be listed under different
departments' databases. One additional problem is that one person can have
different first names or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such runtime fast deduplication tasks for big data
(about 100 million records)?
Any open-source project working on this?


Re: any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?

2014-03-17 Thread Jack Krupansky

See:
https://cwiki.apache.org/confluence/display/solr/De-Duplication

-- Jack Krupansky

-Original Message- 
From: Mobius ReX

Sent: Monday, March 17, 2014 1:59 PM
To: solr-user@lucene.apache.org
Subject: any project for record linkage, fuzzy grouping, and deduplication 
based on Solr/Lucene?


For example, given a new big department merged from three departments. A
few employees worked for two or three departments before merging. That
means, the attributes of one person might be listed under different
departments' databases. One additional problem is that one person can have
different first names or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such runtime fast deduplication tasks for big data
(about 100 million records)?
Any open-source project working on this? 



Re: Newbie question on Deduplication overWriteDupes flag

2014-02-06 Thread Chris Hostetter

: How do I achieve, add if not there, fail if duplicate is found. I though

You can use the optimistic concurrency features to do this, by including a 
_version_=-1 field value in the document.

this will instruct solr that the update should only be processed if the 
document does not already exist...

https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents




-Hoss
http://www.lucidworks.com/


Re: Newbie question on Deduplication overWriteDupes flag

2014-02-06 Thread Alexandre Rafalovitch
A follow up question on this (as it is kind of new functionality).

What happens if several documents are submitted and one of them fails
due to that? Do they get rolled back or only one?

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Feb 6, 2014 at 11:17 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : How do I achieve, add if not there, fail if duplicate is found. I though

 You can use the optimistic concurrency features to do this, by including a
 _version_=-1 field value in the document.

 this will instruct solr that the update should only be processed if the
 document does not already exist...

 https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents




 -Hoss
 http://www.lucidworks.com/


Newbie question on Deduplication overWriteDupes flag

2014-02-04 Thread aagrawal75
I had a configuration where I had overwriteDupes=false. Result: I got
duplicate documents in the index. 

When I changed to overwriteDupes=false, the duplicate documents started
overwriting the older documents. 

How do I achieve, add if not there, fail if duplicate is found. I though
that overwriteDupes=false would do that. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Newbie-question-on-Deduplication-overWriteDupes-flag-tp4115212.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Deduplication use of overWriteDupes flag

2014-02-04 Thread Amit Agrawal
Hello,

I had a configuration where I had overwriteDupes=false. I added few
duplicate documents. Result: I got duplicate documents in the index.

When I changed to overwriteDupes=true, the duplicate documents started
overwriting the older documents.

Question 1: How do I achieve, [add if not there, fail if duplicate is
found] i.e. mimic the behaviour of a DB which fails when trying to insert a
record which violates some unique constraint. I thought that
overwriteDupes=false would do that, but apparently not.

Question2: Is there some documentation around overwriteDupes? I have
checked the existing Wiki; there is very little explanation of the flag
there.

Thanks,

-Amit


Custom update handler with deduplication

2013-12-15 Thread Jorge Luis Betancourt González
Currently I've the following Update Request Processor chain to prevent indexing 
very similar text items into a core dedicated to store queries that our users 
put into the web interface of our system.

!-- Delete similar duplicated documents on index time, using some fuzzy text 
similary techniques --
updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldsignature/str
  str name=fieldstextsuggest,textng/str
  str 
name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

Right now we are trying to implement a custom update request handler to keep 
track of how many any given query hits our solr server, in plain simple we want 
to keep a field that counts how many we have tried to insert the same query. We 
are using Solr 3.6, so how can we use (from the code of our custom update 
handler) the deduplicatin request processor to check if the query we are trying 
to insert/update already exists?

Greetings! 

III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
2014. Ver www.uci.cu


Re: Custom update handler with deduplication

2013-12-15 Thread Shalin Shekhar Mangar
Firstly, I see that you have overwriteDupes=false in your
configuration. This means that a signature will be generated but the
similar documents will still be added to the index. Now to your main
question about counting duplicate attempts, one simple way is to have
another UpdateRequestProcessor after the SignatureUpdateProcessor
which keeps a map of Signature to Count. You can even keep this
counter inside the Solr document as well and first read the old
counter value by querying the signatureField and then writing the new
value in the new document. Be careful about race conditions if you're
reading from the index because indexing can happen in multiple
threads.

On Mon, Dec 16, 2013 at 9:01 AM, Jorge Luis Betancourt González
jlbetanco...@uci.cu wrote:
 Currently I've the following Update Request Processor chain to prevent 
 indexing very similar text items into a core dedicated to store queries that 
 our users put into the web interface of our system.

 !-- Delete similar duplicated documents on index time, using some fuzzy text 
 similary techniques --
 updateRequestProcessorChain name=dedupe
 processor 
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   bool name=overwriteDupesfalse/bool
   str name=signatureFieldsignature/str
   str name=fieldstextsuggest,textng/str
   str 
 name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 Right now we are trying to implement a custom update request handler to keep 
 track of how many any given query hits our solr server, in plain simple we 
 want to keep a field that counts how many we have tried to insert the same 
 query. We are using Solr 3.6, so how can we use (from the code of our custom 
 update handler) the deduplicatin request processor to check if the query we 
 are trying to insert/update already exists?

 Greetings!
 
 III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
 2014. Ver www.uci.cu



-- 
Regards,
Shalin Shekhar Mangar.


Pros and Cons of Using Deduplication of Solr at Huge Data Indexing

2013-05-02 Thread Furkan KAMACI
I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them
with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly
it does and does it results with a slow indexing or is it beneficial for my
situation?


RE: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing

2013-05-02 Thread Markus Jelsma
Distributed deduplication does not work right now:
https://issues.apache.org/jira/browse/SOLR-3473

We've chosen not do use update processors for deduplication anymore and rely on 
several custom mapreduce jobs in Nutch and some custom collectors in Solr to do 
some on-demand online deduplication.

If SOLR-3473 is fixed you can get very decent deduplication.

-Original message-
 From:Furkan KAMACI furkankam...@gmail.com
 Sent: Thu 02-May-2013 22:30
 To: solr-user@lucene.apache.org
 Subject: Pros and Cons of Using Deduplication of Solr at Huge Data Indexing
 
 I use Solr 4.2.1 as SolrCloud. I crawl huge data with Nutch and index them
 with SolrCloud. I wonder about Solr's deduplication mechanism. What exactly
 it does and does it results with a slow indexing or is it beneficial for my
 situation?
 


Deduplication in SolrCloud

2012-07-27 Thread Daniel Brügge
Hi,

in my old Solr Setup I have used the deduplication feature in the update
chain
with couple of fields.

updateRequestProcessorChain name=dedupe
 processor class=solr.processor.SignatureUpdateProcessorFactory
bool name=enabledtrue/bool
 str name=signatureFieldsignature/str
bool name=overwriteDupesfalse/bool
 str name=fieldsuuid,type,url,content_hash/str
str
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
 /processor
processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

This worked fine. When I now use this in my 2 shards SolrCloud setup when
inserting 150.000 documents,
I am always getting an error:

*INFO: end_commit_flush*
*Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
*SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
unable to create new native thread*
* at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
*
* at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
*

I am inserting the documents via CSV import and curl command and split them
also into 50k chunks.

Without the dedupe chain, the import finishes after 40secs.

The curl command writes to one of my shards.


Do you have an idea why this happens? Should I reduce the fields to one? I
have read that not using the id as
dedupe fields could be an issue?


I have searched for deduplication with SolrCloud and I am wondering if it
is already working correctly? see e.g.
http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

Thanks  regards

Daniel


RE: Deduplication in SolrCloud

2012-07-27 Thread Markus Jelsma
This issue doesn't really describe your problem but a more general problem of 
distributed deduplication:
https://issues.apache.org/jira/browse/SOLR-3473
 
 
-Original message-
 From:Daniel Brügge daniel.brue...@googlemail.com
 Sent: Fri 27-Jul-2012 17:38
 To: solr-user@lucene.apache.org
 Subject: Deduplication in SolrCloud
 
 Hi,
 
 in my old Solr Setup I have used the deduplication feature in the update
 chain
 with couple of fields.
 
 updateRequestProcessorChain name=dedupe
  processor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
  str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
  str name=fieldsuuid,type,url,content_hash/str
 str
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
  /processor
 processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain
 
 This worked fine. When I now use this in my 2 shards SolrCloud setup when
 inserting 150.000 documents,
 I am always getting an error:
 
 *INFO: end_commit_flush*
 *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
 *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
 unable to create new native thread*
 * at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
 *
 * at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
 *
 
 I am inserting the documents via CSV import and curl command and split them
 also into 50k chunks.
 
 Without the dedupe chain, the import finishes after 40secs.
 
 The curl command writes to one of my shards.
 
 
 Do you have an idea why this happens? Should I reduce the fields to one? I
 have read that not using the id as
 dedupe fields could be an issue?
 
 
 I have searched for deduplication with SolrCloud and I am wondering if it
 is already working correctly? see e.g.
 http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
 
 Thanks  regards
 
 Daniel
 


Re: Deduplication in SolrCloud

2012-07-27 Thread Lance Norskog
Should the old Signature code be removed? Given that the goal is to
have everyone use SolrCloud, maybe this kind of landmine should be
removed?

On Fri, Jul 27, 2012 at 8:43 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
 This issue doesn't really describe your problem but a more general problem of 
 distributed deduplication:
 https://issues.apache.org/jira/browse/SOLR-3473


 -Original message-
 From:Daniel Brügge daniel.brue...@googlemail.com
 Sent: Fri 27-Jul-2012 17:38
 To: solr-user@lucene.apache.org
 Subject: Deduplication in SolrCloud

 Hi,

 in my old Solr Setup I have used the deduplication feature in the update
 chain
 with couple of fields.

 updateRequestProcessorChain name=dedupe
  processor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
  str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
  str name=fieldsuuid,type,url,content_hash/str
 str
 name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
  /processor
 processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

 This worked fine. When I now use this in my 2 shards SolrCloud setup when
 inserting 150.000 documents,
 I am always getting an error:

 *INFO: end_commit_flush*
 *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
 *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
 unable to create new native thread*
 * at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
 *
 * at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
 *

 I am inserting the documents via CSV import and curl command and split them
 also into 50k chunks.

 Without the dedupe chain, the import finishes after 40secs.

 The curl command writes to one of my shards.


 Do you have an idea why this happens? Should I reduce the fields to one? I
 have read that not using the id as
 dedupe fields could be an issue?


 I have searched for deduplication with SolrCloud and I am wondering if it
 is already working correctly? see e.g.
 http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

 Thanks  regards

 Daniel




-- 
Lance Norskog
goks...@gmail.com


Deduplication in MLT

2012-06-12 Thread Pranav Prakash
I have an implementation of Deduplication as mentioned at
http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search
results. I would like to achieve the same functionality in my MLT queries,
where the result set should include grouped documents. What is a good way
to do the same?


*Pranav Prakash*

temet nosce


RE: SolrCloud deduplication

2012-05-21 Thread Markus Jelsma
Hi,

SOLR-2822 seems to work just fine as long as the SignatureProcessor precedes 
the DistributedProcessor in the update chain. 

Thanks,
Markus

 
 
-Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Fri 18-May-2012 16:05
 To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io
 Subject: Re: SolrCloud deduplication
 
 Hey Markus -
 
 When I ran into a similar issue with another update proc, I created 
 https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things 
 to avoid this. I have not committed this yet though, in favor of waiting for 
 https://issues.apache.org/jira/browse/SOLR-2822
 
 Go vote? :)
 
 On May 18, 2012, at 7:49 AM, Markus Jelsma wrote:
 
  Hi,
  
  Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is 
  not 
  functional anymore. The problem is that documents are passed multiple times 
  through the URP and the digest field is added as if it is an multi valued 
  field. 
  If the field is not multi valued you'll get this typical error. Changing 
  the 
  order or URP's in the chain does not solve the problem.
  
  Any hints on how to resolve the issue? Is this a problem in the 
  SignatureUpdateRequestProcessor and does it need to be updated to work with 
  SolrCloud? 
  
  Thanks,
  Markus
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 


RE: SolrCloud deduplication

2012-05-21 Thread Markus Jelsma
Hi again,

It seemed to work fine but in the end duplicates are not overwritten. We first 
run the SignatureProcessor and then the DistributedProcessor. If we do it the 
other way around the digest field receives multiple values and throws errors. 
Is there anything else we can do or another patch to try?

Thanks
Markus
 
 
-Original message-
 From:Markus Jelsma markus.jel...@openindex.io
 Sent: Mon 21-May-2012 15:58
 To: solr-user@lucene.apache.org; Mark Miller markrmil...@gmail.com
 Subject: RE: SolrCloud deduplication
 
 Hi,
 
 SOLR-2822 seems to work just fine as long as the SignatureProcessor precedes 
 the DistributedProcessor in the update chain. 
 
 Thanks,
 Markus
 
  
  
 -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Fri 18-May-2012 16:05
  To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io
  Subject: Re: SolrCloud deduplication
  
  Hey Markus -
  
  When I ran into a similar issue with another update proc, I created 
  https://issues.apache.org/jira/browse/SOLR-3215 so that I could order 
  things to avoid this. I have not committed this yet though, in favor of 
  waiting for https://issues.apache.org/jira/browse/SOLR-2822
  
  Go vote? :)
  
  On May 18, 2012, at 7:49 AM, Markus Jelsma wrote:
  
   Hi,
   
   Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is 
   not 
   functional anymore. The problem is that documents are passed multiple 
   times 
   through the URP and the digest field is added as if it is an multi valued 
   field. 
   If the field is not multi valued you'll get this typical error. Changing 
   the 
   order or URP's in the chain does not solve the problem.
   
   Any hints on how to resolve the issue? Is this a problem in the 
   SignatureUpdateRequestProcessor and does it need to be updated to work 
   with 
   SolrCloud? 
   
   Thanks,
   Markus
  
  - Mark Miller
  lucidimagination.com
  
  
  
  
  
  
  
  
  
  
  
  
 


RE: SolrCloud deduplication

2012-05-21 Thread Markus Jelsma
https://issues.apache.org/jira/browse/SOLR-3473

-Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Mon 21-May-2012 18:11
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud deduplication
 
 Looking again at the SignatureUpdateProcessor code, I think that indeed this 
 won't currently work with distrib updates. Could you file a JIRA issue for 
 that? The problem is that we convert update commands into solr documents - 
 and that can cause a loss of info if an update proc modifies the update 
 command.
 
 I think the reason that you see a multiple values error when you try the 
 other order is because of the lack of a document clone (the other issue I 
 mentioned a few emails back). Addressing that won't solve your issue though - 
 we have to come up with a way to propagate the currently lost info on the 
 update command.
 
 - Mark
 
 On May 21, 2012, at 10:39 AM, Markus Jelsma wrote:
 
  Hi again,
  
  It seemed to work fine but in the end duplicates are not overwritten. We 
  first run the SignatureProcessor and then the DistributedProcessor. If we 
  do it the other way around the digest field receives multiple values and 
  throws errors. Is there anything else we can do or another patch to try?
  
  Thanks
  Markus
  
  
  -Original message-
  From:Markus Jelsma markus.jel...@openindex.io
  Sent: Mon 21-May-2012 15:58
  To: solr-user@lucene.apache.org; Mark Miller markrmil...@gmail.com
  Subject: RE: SolrCloud deduplication
  
  Hi,
  
  SOLR-2822 seems to work just fine as long as the SignatureProcessor 
  precedes the DistributedProcessor in the update chain. 
  
  Thanks,
  Markus
  
  
  
  -Original message-
  From:Mark Miller markrmil...@gmail.com
  Sent: Fri 18-May-2012 16:05
  To: solr-user@lucene.apache.org; Markus Jelsma 
  markus.jel...@openindex.io
  Subject: Re: SolrCloud deduplication
  
  Hey Markus -
  
  When I ran into a similar issue with another update proc, I created 
  https://issues.apache.org/jira/browse/SOLR-3215 so that I could order 
  things to avoid this. I have not committed this yet though, in favor of 
  waiting for https://issues.apache.org/jira/browse/SOLR-2822
  
  Go vote? :)
  
  On May 18, 2012, at 7:49 AM, Markus Jelsma wrote:
  
  Hi,
  
  Deduplication on SolrCloud through the SignatureUpdateRequestProcessor 
  is not 
  functional anymore. The problem is that documents are passed multiple 
  times 
  through the URP and the digest field is added as if it is an multi 
  valued field. 
  If the field is not multi valued you'll get this typical error. Changing 
  the 
  order or URP's in the chain does not solve the problem.
  
  Any hints on how to resolve the issue? Is this a problem in the 
  SignatureUpdateRequestProcessor and does it need to be updated to work 
  with 
  SolrCloud? 
  
  Thanks,
  Markus
  
  - Mark Miller
  lucidimagination.com
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 


SolrCloud deduplication

2012-05-18 Thread Markus Jelsma
Hi,

Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not 
functional anymore. The problem is that documents are passed multiple times 
through the URP and the digest field is added as if it is an multi valued 
field. 
If the field is not multi valued you'll get this typical error. Changing the 
order or URP's in the chain does not solve the problem.

Any hints on how to resolve the issue? Is this a problem in the 
SignatureUpdateRequestProcessor and does it need to be updated to work with 
SolrCloud? 

Thanks,
Markus


Re: SolrCloud deduplication

2012-05-18 Thread Mark Miller
Hey Markus -

When I ran into a similar issue with another update proc, I created 
https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things to 
avoid this. I have not committed this yet though, in favor of waiting for 
https://issues.apache.org/jira/browse/SOLR-2822

Go vote? :)

On May 18, 2012, at 7:49 AM, Markus Jelsma wrote:

 Hi,
 
 Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is not 
 functional anymore. The problem is that documents are passed multiple times 
 through the URP and the digest field is added as if it is an multi valued 
 field. 
 If the field is not multi valued you'll get this typical error. Changing the 
 order or URP's in the chain does not solve the problem.
 
 Any hints on how to resolve the issue? Is this a problem in the 
 SignatureUpdateRequestProcessor and does it need to be updated to work with 
 SolrCloud? 
 
 Thanks,
 Markus

- Mark Miller
lucidimagination.com













RE: SolrCloud deduplication

2012-05-18 Thread Markus Jelsma
Hi,

Interesting! I'm watching the issues and will test as soon as they are 
committed.

Thanks!

 
 
-Original message-
 From:Mark Miller markrmil...@gmail.com
 Sent: Fri 18-May-2012 16:05
 To: solr-user@lucene.apache.org; Markus Jelsma markus.jel...@openindex.io
 Subject: Re: SolrCloud deduplication
 
 Hey Markus -
 
 When I ran into a similar issue with another update proc, I created 
 https://issues.apache.org/jira/browse/SOLR-3215 so that I could order things 
 to avoid this. I have not committed this yet though, in favor of waiting for 
 https://issues.apache.org/jira/browse/SOLR-2822
 
 Go vote? :)
 
 On May 18, 2012, at 7:49 AM, Markus Jelsma wrote:
 
  Hi,
  
  Deduplication on SolrCloud through the SignatureUpdateRequestProcessor is 
  not 
  functional anymore. The problem is that documents are passed multiple times 
  through the URP and the digest field is added as if it is an multi valued 
  field. 
  If the field is not multi valued you'll get this typical error. Changing 
  the 
  order or URP's in the chain does not solve the problem.
  
  Any hints on how to resolve the issue? Is this a problem in the 
  SignatureUpdateRequestProcessor and does it need to be updated to work with 
  SolrCloud? 
  
  Thanks,
  Markus
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 


RE: SolrCloud deduplication

2012-05-18 Thread Chris Hostetter

: Interesting! I'm watching the issues and will test as soon as they are 
committed.

FWIW: it's a chicken and egg problem -- if you could test out the patch in 
SOLR-2822 with your real world use case / configs, and comment on it's 
effectiveness, that would go a long way towards my confidence in it.


-Hoss


RE: SolrCloud deduplication

2012-05-18 Thread Markus Jelsma
you're right. I'll test the patch as soon as possible.
Thanks!

 
 
-Original message-
 From:Chris Hostetter hossman_luc...@fucit.org
 Sent: Fri 18-May-2012 18:20
 To: solr-user@lucene.apache.org
 Subject: RE: SolrCloud deduplication
 
 
 : Interesting! I'm watching the issues and will test as soon as they are 
 committed.
 
 FWIW: it's a chicken and egg problem -- if you could test out the patch in 
 SOLR-2822 with your real world use case / configs, and comment on it's 
 effectiveness, that would go a long way towards my confidence in it.
 
 
 -Hoss
 


Re: null pointer error with solr deduplication

2012-04-23 Thread Mark Miller
A better error would be nicer.

In the past, when I have had docs with the same id on multiple shards, I
never saw an NPE problem. A lot has changed since then though. I guess, to
me, checking if the id is stored sticks out a bit more. Roughly based on
the stacktrace, it looks to me like it's not finding an id value and that
is causing the NPE.

If it's a legit problem we should probably make a JIRA issue about
improving the error message you end up getting.

-- 
- Mark

http://www.lucidimagination.com

On Sat, Apr 21, 2012 at 5:21 AM, Alexander Aristov 
alexander.aris...@gmail.com wrote:

 Hi

 I might be wrong but it's your responsibility to put unique doc IDs across
 shards.

 read this page

 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

 particualry

   - Documents must have a unique key and the unique key must be stored
   (stored=true in schema.xml)
   -

   *The unique key field must be unique across all shards.* If docs with
   duplicate unique keys are encountered, Solr will make an attempt to
 return
   valid results, but the behavior may be non-deterministic.

 So solr bahaves as it should :) _unexpectidly_

 But I agree in that sence that there must be no error especially such as
 NPE.

 Best Regards
 Alexander Aristov


 On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote:

  Hello,
 
  I have been trying out deduplication in solr by following:
  http://wiki.apache.org/solr/Deduplication. I have defined a signature
  field
  to hold the values of the signature created based on few other fields in
 a
  document and the idea seems to work like a charm in a single solr
 instance.
  But, when I have multiple cores and try to do a distributed search (
 
 
 Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id
  )
  I get the error pasted below. While normal search (with just q) works
 fine,
  the facet/stats queries seem to be the culprit. The doc_id contains
  duplicate ids since I'm testing the same set of documents indexed in both
  the cores(dedupe, dedupe2). Any insights would be highly appreciated.
 
  Thanks
 
 
 
  20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
  SEVERE: java.lang.NullPointerException
  at
 
 
 org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
  at
 
 
 org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
  at
 
 
 org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
  at
 
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
  at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
  at
 
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  at
 
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
  at
 
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
  at
 
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
  at
 
 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
  at
 
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
  at
 
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
  at
  org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
  at
 
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
  at
 
 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
  at
 
 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
  at
 
 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
 



Re: null pointer error with solr deduplication

2012-04-23 Thread Peter Markey
Thanks for the response. Yes, I agree with you that I have to check for the
uniqueness of doc ids but our requirement is such that we need to send it
to solr and I know that solr discards duplicate documents and it does not
work fine when we manually create the unique id. But I just wanted to
report the error since in this scenario (i guess the components for
deduplication are pretty new), it would probably help the devs to make the
behavior more deterministic towards duplicate documents.

On Sat, Apr 21, 2012 at 2:21 AM, Alexander Aristov 
alexander.aris...@gmail.com wrote:

 Hi

 I might be wrong but it's your responsibility to put unique doc IDs across
 shards.

 read this page

 http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

 particualry

   - Documents must have a unique key and the unique key must be stored
   (stored=true in schema.xml)
   -

   *The unique key field must be unique across all shards.* If docs with
   duplicate unique keys are encountered, Solr will make an attempt to
 return
   valid results, but the behavior may be non-deterministic.

 So solr bahaves as it should :) _unexpectidly_

 But I agree in that sence that there must be no error especially such as
 NPE.

 Best Regards
 Alexander Aristov


 On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote:

  Hello,
 
  I have been trying out deduplication in solr by following:
  http://wiki.apache.org/solr/Deduplication. I have defined a signature
  field
  to hold the values of the signature created based on few other fields in
 a
  document and the idea seems to work like a charm in a single solr
 instance.
  But, when I have multiple cores and try to do a distributed search (
 
 
 Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id
  )
  I get the error pasted below. While normal search (with just q) works
 fine,
  the facet/stats queries seem to be the culprit. The doc_id contains
  duplicate ids since I'm testing the same set of documents indexed in both
  the cores(dedupe, dedupe2). Any insights would be highly appreciated.
 
  Thanks
 
 
 
  20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
  SEVERE: java.lang.NullPointerException
  at
 
 
 org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
  at
 
 
 org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
  at
 
 
 org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
  at
 
 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
  at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
  at
 
 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  at
 
 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
  at
 
 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
  at
 
 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
  at
 
 
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
  at
 
 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
  at
 
 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
  at
  org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
  at
 
 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
  at
 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
  at
 
 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
  at
 
 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
  at
 
 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
 



Re: null pointer error with solr deduplication

2012-04-21 Thread Alexander Aristov
Hi

I might be wrong but it's your responsibility to put unique doc IDs across
shards.

read this page
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

particualry

   - Documents must have a unique key and the unique key must be stored
   (stored=true in schema.xml)
   -

   *The unique key field must be unique across all shards.* If docs with
   duplicate unique keys are encountered, Solr will make an attempt to return
   valid results, but the behavior may be non-deterministic.

So solr bahaves as it should :) _unexpectidly_

But I agree in that sence that there must be no error especially such as
NPE.

Best Regards
Alexander Aristov


On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote:

 Hello,

 I have been trying out deduplication in solr by following:
 http://wiki.apache.org/solr/Deduplication. I have defined a signature
 field
 to hold the values of the signature created based on few other fields in a
 document and the idea seems to work like a charm in a single solr instance.
 But, when I have multiple cores and try to do a distributed search (

 Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id
 )
 I get the error pasted below. While normal search (with just q) works fine,
 the facet/stats queries seem to be the culprit. The doc_id contains
 duplicate ids since I'm testing the same set of documents indexed in both
 the cores(dedupe, dedupe2). Any insights would be highly appreciated.

 Thanks



 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
 at

 org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
 at

 org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
 at

 org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
 at

 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at

 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at

 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at

 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)



null pointer error with solr deduplication

2012-04-20 Thread Peter Markey
Hello,

I have been trying out deduplication in solr by following:
http://wiki.apache.org/solr/Deduplication. I have defined a signature field
to hold the values of the signature created based on few other fields in a
document and the idea seems to work like a charm in a single solr instance.
But, when I have multiple cores and try to do a distributed search (
Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id)
I get the error pasted below. While normal search (with just q) works fine,
the facet/stats queries seem to be the culprit. The doc_id contains
duplicate ids since I'm testing the same set of documents indexed in both
the cores(dedupe, dedupe2). Any insights would be highly appreciated.

Thanks



20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887)
at
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633)
at
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


Re: Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-16 Thread Chris Hostetter

: I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
: blog articles from different sources, with slight changes (author name,
: etc..)).
: But they have differences.
: *Now i like to see 1 doc in my result set and the other 4 should be marked
: as similar.*

Do you actaully want al 1000 docs in your index, or do you want to prevent 
4 of the 5 copies of hte doc from being indexed?

Either way, if the the TextProfileSignature is doing a good job of 
identifying the 5 similar docs, then use that at index time.

If you want to keep 4/5 out of the index, then use the Deduplcation 
features to prefent the duplicates from being indexed and your done.  

If you wnat all docs in the index, then you have to decide how you want to 
mark docs as similar ... do you want to only have one of those docs 
appear in all of your results, or do you want all of them in the results 
but with an indication that there are other similar docs?  If the former: 
then take a look at Grouping and group on your signature field.  If the 
latter, use the MLT component, to find similar docs based on the signature 
field (ie: mlt.fl=signature_t)

https://wiki.apache.org/solr/FieldCollapsing

-Hoss


Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-07 Thread Vadim Kisselmann
Hello folks,

i have questions about MLT and Deduplication and what would be the best
choice in my case.

Case:

I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
blog articles from different sources, with slight changes (author name,
etc..)).
But they have differences.
*Now i like to see 1 doc in my result set and the other 4 should be marked
as similar.*

With *MLT*:
str name=mlt.fltext/str
  int name=mlt.minwl5/int
  int name=mlt.maxwl50/int
  int name=mlt.maxqt3/int
  int name=mlt.maxntp5000/int
  bool name=mlt.boosttrue/bool
  str name=mlt.qftext/str
   /lst

With this config i get about 500 similar docs for this 1 doc, unfortunately
too much.


*Deduplication*:
I index this docs now with an signature and i'm using TextProfileSignature.

updateRequestProcessorChain name=dedupe
   processor class=solr.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
 str name=signatureFieldsignature_t/str
 bool name=overwriteDupesfalse/bool
 str name=fieldstext/str
 str
name=signatureClasssolr.processor.TextProfileSignature/str
/processor
   processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

How can i compare the created signatures?


I want only see the 5 similar docs, nothing else.
Which of this two cases is relevant to me? Can i tune the MLT for my
requirement? Or should i use Dedupe?

Thanks and Regards
Vadim


Re: A good signature class for deduplication

2011-09-01 Thread Chris Hostetter

: I want to deduplicate documents from search results. What should be the
: parameters on which I should decide an efficient SignatureClass? Also, what
: are the SignaureClasses available?

the signature classes available are the ones mentioned on the wiki...

https://wiki.apache.org/solr/Deduplication

...which one you should choose, and which fields you feed it depend 
entirely on your goal -- if you want to deduplicate anytime both the 
user_fname and user_lname fields are exactly the same, then use those 
fields with either the MD5Signature  or the Lookup3Signature -- (lookup3 
is faster, but some people want MD5 because they want to use the computed 
MD5 for other things)

if you want to detext when some much longer body field containing a lot 
of full test is *nearly* identical, then you should consider the 
TextProfileSignature -- how exactly it works and how you tune it i 
don't know off the top of my head.



-Hoss


Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-30 Thread Pranav Prakash
Solr 3.3. has a feature Grouping. Is it practically same as deduplication?

Here is my use case for duplicates removal -

We have many documents with similar (upto 99%) content. Upon some search
queries, almost all of them come up on first page results. Of all these
documents, essentially one is original and the other are duplicates. We are
able to find the original content on a basis of number of factors - who
uploaded it, when, how many viral shares.It is also possible that the
duplicates are uploaded earlier (and hence exist in search index) while the
original is uploaded later (and gets added later to index).

AFAIK, Deduplication targets index time. Is there a means I can specify the
original which should be returned and the duplicates which could be removed
from coming up.?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


Re: Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-30 Thread Marc Sturlese
Deduplication uses lucene indexWriter.updateDocument using the signature
term. I don't think it's possible as a default feature to choose wich
document to index, the original should be always the last to be indexed.
/IndexWriter.updateDocument
Updates a document by first deleting the document(s) containing term and
then adding the new document. The delete and then add are atomic as seen by
a reader on the same index (flush may happen only after the add)./

With grouping you have all your documents indexed so it gives you more
flexibility

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-3-Grouping-vs-DeDuplication-and-Deduplication-Use-Case-tp3294711p3295023.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to combine Deduplication and Elevation

2011-05-02 Thread Chris Hostetter

: Hi I have a question. How to combine the Deduplication and Elevation
: implementations in Solr. Currently , I managed to implement either one only.

can you elaborate a bit more on what exactly you've tried and what problem 
you are facing?

the SignatureUpdateProcessorFactory (which is used for Deduplication) and 
the QueryElevation component should work just fine together -- in fact: 
one is used at index time and hte ohter at query time, so where shouldn't 
be any interaction at all...

http://wiki.apache.org/solr/Deduplication
http://wiki.apache.org/solr/QueryElevationComponent

-Hoss


How to combine Deduplication and Elevation

2011-04-15 Thread shamex
Hi I have a question. How to combine the Deduplication and Elevation
implementations in Solr. Currently , I managed to implement either one only.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-combine-Deduplication-and-Elevation-tp2819621p2819621.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Deduplication questions

2011-04-11 Thread Chris Hostetter

: Q1. Is is possible to pass *analyzed* content to the
: 
: public abstract class Signature {

No, analysis happens as the documents are being written to the lucene 
index, well after the UpdateProcessors have had a chance to interact with 
the values.

: Q2. Method calculate() is using concatenated fields from str
: name=fieldsname,features,cat/str
: Is there any mechanism I could build  field dependant signatures?

At the moment the Signature API is fairly minimal, but it could definitley 
be improved by adding more methods (that have sensible defaults in the 
base class) that would give the impl more control over teh resulting 
signature ... we just beed people to propose good suggestions with example 
use cases.

: Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)

I don't know that what you describe is really intentional or not, but it 
should work


-Hoss


Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-04 Thread eks dev
Thanks Hoss,

Externanlizing this part is exactly the path we are exploring now, not
only for this reason.

We already started testing Hadoop SequenceFile for write ahead log for
updates/deletes.
SequenceFile supports append now (simply great!). It was a a pain to
have to add hadoop into mix  for mortal collection
sizes 200 Mio, but on the other side, having hadoop around  offers
huge flexibility.
Write ahead log catches update commands (all solr slaves, fronting
clients accept updates but only to forward them to WAL). Solr master
is trying to catch up with update stream indexing in async fashion,
and finally solr slaves are chasing master index with standard solr
replication.
Overnight we run simple map reduce jobs to consolidate, normalize and
sort update stream and reindex at the end.
Deduplication and collection sorting is for us only an optimization,
if done reasonably offten, like  once per day/week, but if we do not
do it, it doubles HW resorces.

Imo, native WAL support on solr would be definitly one nice nice to
have (for HA, update scalability...). Charming with WAL  is that
updates never wait/disapear, if too much traffic, we only have
slightly higher update latency, but updates get definitley processed.
Some basic primitives on WAL (consolidation, replaying update stream
on solr etc...)  should be supported in this case, sort of smallish
hadoop features subset for solr clusters, but nothing oversized.

Cheers,
eks









On Sun, Apr 3, 2011 at 1:05 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : Is it possible in solr to have multivalued id? Or I need to make my
 : own mv_ID for this? Any ideas how to achieve this efficiently?

 This isn't something the SignatureUpdateProcessor is going to be able to
 hel pyou with -- it does the deduplication be changing hte low level
 update (implemented as a delete then add) so that the key used to delete
 the older documents is based on the signature field instead of the id
 field.

 in order to do what you are describing, you would need to query the index
 for matching signatures, then add the resulting ids to your document
 before doing that update

 You could posibly do this in a custom UpdateProcessor, but you'd have to
 do something tricky to ensure you didn't overlook docs that had been addd
 but not yet committed when checking for dups.

 I don't have a good suggestion for how to do this internally in Slr -- it
 seems like the type of bulk processing logic that would be better suited
 for an external process before you ever start indexing (much like link
 analysis for back refrences)

 -Hoss



Re: Question about http://wiki.apache.org/solr/Deduplication

2011-04-02 Thread Chris Hostetter

: Is it possible in solr to have multivalued id? Or I need to make my
: own mv_ID for this? Any ideas how to achieve this efficiently?

This isn't something the SignatureUpdateProcessor is going to be able to 
hel pyou with -- it does the deduplication be changing hte low level 
update (implemented as a delete then add) so that the key used to delete 
the older documents is based on the signature field instead of the id 
field.

in order to do what you are describing, you would need to query the index 
for matching signatures, then add the resulting ids to your document 
before doing that update

You could posibly do this in a custom UpdateProcessor, but you'd have to 
do something tricky to ensure you didn't overlook docs that had been addd 
but not yet committed when checking for dups.

I don't have a good suggestion for how to do this internally in Slr -- it 
seems like the type of bulk processing logic that would be better suited 
for an external process before you ever start indexing (much like link 
analysis for back refrences)

-Hoss


Deduplication questions

2011-03-25 Thread eks dev
Q1. Is is possible to pass *analyzed* content to the

public abstract class Signature {
  public void init(SolrParams nl) {  }
  public abstract String calculate(String content);
}


Q2. Method calculate() is using concatenated fields from str
name=fieldsname,features,cat/str
Is there any mechanism I could build  field dependant signatures?

Use case for this: I have two fields:
OWNER , TEXT
I need to disable *fuzzy* duplicates for one owner, one clean way
would be to make prefixed signature OWNER/FUZZY_SIGNATURE

Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
would work)

  updateRequestProcessorChain name=signature_hard
  bool name=enabledtrue/bool
  bool name=overwriteDupesfalse/bool
  str name=signatureFieldexact_signature/str
  str name=fieldsOWNER/str
  str name=signatureClassExactSignature/str
/processor
  /updateRequestProcessorChain

hard_signature should not be  stored and not indexed field

  updateRequestProcessorChain name=fuzzy_and_mix
  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFieldmixed_signature/str
  str name=fieldsexact_signature, TEXT/str
  str name=signatureClassMixedSignature/str
/processor
  /updateRequestProcessorChain

 field name=hard_signature   type=string stored=false
indexed=false multiValued=false /
 field name=mixed_signature type=string stored=true
indexed=true multiValued=false /

Assuming I know how long my exact_signature is, I could calculate
fuzzy part and mix it properly.

Possible, better ideas?

Thanks,
eks


Question about http://wiki.apache.org/solr/Deduplication

2011-03-24 Thread eks dev
Hi,
Use case I am trying to figure out is about preserving IDs without
re-indexing on duplicate, rather adding this new ID under list of
document id aliases.

Example:
Input collection:
id:1, text:dummy text 1, signature:A
id:2, text:dummy text 1, signature:A

I add the first document in empty index, text is going to be indexed,
ID is going to be 1, so far so good

Now the question, if I add second document with id == 2, instead of
deleting/indexing this new document, I would like to store id == 2 in
multivalued Field id

At the end, I would have one document less indexed and both ID are
going to be searchable (and stored as well)...

Is it possible in solr to have multivalued id? Or I need to make my
own mv_ID for this? Any ideas how to achieve this efficiently?

My target is not to add new documents if signature matches, but to
have IDs indexed and stored?

Thanks,
eks


SOLR deduplication

2011-01-26 Thread Jason Brown
Hi - I have the SOLR deduplication configured and working well.

Is there any way I can tell which documents have been not added to the index as 
a result of the deduplication rejecting subsequent identical documents?

Many Thanks

Jason Brown.

If you wish to view the St. James's Place email disclaimer, please use the link 
below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: SOLR deduplication

2011-01-26 Thread Markus Jelsma
Not right now:
https://issues.apache.org/jira/browse/SOLR-1909

 Hi - I have the SOLR deduplication configured and working well.
 
 Is there any way I can tell which documents have been not added to the
 index as a result of the deduplication rejecting subsequent identical
 documents?
 
 Many Thanks
 
 Jason Brown.
 
 If you wish to view the St. James's Place email disclaimer, please use the
 link below
 
 http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer


Re: Is deduplication possible during Tika extract?

2011-01-17 Thread Markus Jelsma
In my opinion it should work for every update handler. If you're really sure 
your configuration if fine and it still doesn't work you might have to file an 
issue.

Your configuration looks alright but don't forget you've configured 
overwriteDupes=false!

 Hello,
 
 here is an excerpt of my solrconfig.xml:
 
 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 startup=lazy
 lst name=defaults
 
 str name=update.processordedupe/str
 
 !-- All the main content goes into text... if you need to return
 the extracted text or do highlighting, use a stored field. --
 str name=fmap.contenttext/str
 str name=lowernamestrue/str
 str name=uprefixignored_/str
 
 !-- capture link hrefs but ignore div attributes --
 str name=captureAttrtrue/str
 str name=fmap.alinks/str
 str name=fmap.divignored_/str
 /lst
 /requestHandler
 
 and
 
 updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
 bool name=enabledtrue/bool
 str name=signatureFieldsignature/str
 bool name=overwriteDupesfalse/bool
 str name=fieldstext/str
 str
 name=signatureClassorg.apache.solr.update.processor.TextProfileSignature
 /str /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain
 
 deduplication works when I use only /update but not when solr does an
 extract with Tika!
 Is deduplication possible during Tika extract?
 
 Thanks in advance,
 Arno


Is deduplication possible during Tika extract?

2011-01-14 Thread arnaud gaudinat

Hello,

here is an excerpt of my solrconfig.xml:

requestHandler name=/update/extract 
class=org.apache.solr.handler.extraction.ExtractingRequestHandler 
startup=lazy

lst name=defaults

str name=update.processordedupe/str

!-- All the main content goes into text... if you need to return
   the extracted text or do highlighting, use a stored field. --
str name=fmap.contenttext/str
str name=lowernamestrue/str
str name=uprefixignored_/str

!-- capture link hrefs but ignore div attributes --
str name=captureAttrtrue/str
str name=fmap.alinks/str
str name=fmap.divignored_/str
/lst
/requestHandler

and

updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

bool name=enabledtrue/bool
str name=signatureFieldsignature/str
bool name=overwriteDupesfalse/bool
str name=fieldstext/str
str 
name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str

/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

deduplication works when I use only /update but not when solr does an 
extract with Tika!

Is deduplication possible during Tika extract?

Thanks in advance,
Arno



Solr Deduplication and Field Collpasing

2010-09-28 Thread Nemani, Raj
All,

 

I have setup Nutch to submit the crawl results to Solr index.  I have
some duplicates in the documents generated by the Nutch crawl.  There is
filed 'digest' that Nutch generates that is same for those documents
that are duplicates.  While setting up the the dedupe processor in the
Solr config file, I have used this 'Digest' field in the following
way(see below for config details).  Since my index has documents other
than the ones generated by Nutch I cannot use 'overwritedupes=true
because for non-Nutch generated documents the digest field will not be
populated and I found that Solr deletes every one of those documents
that do not have the digest filed populated. Probably because they all
will have the same 'sig' filed value generated based on an 'empty'
digest field forcing Solr to delete everything?

 

In any case, given the scenario I though I would set
'overwritedupes=false' and use field collapsing based on digest or sig
filed but I could not get filed collapsing to work.  Based on the wiki
documentation I was adding the query string
group=truegroup.filed=sig group=truegroup.filed=digest to my
over all query in admin console and I still got the duplicate documents
in the results.  Is there anything special I need to do to get field
collapsing working?  I am running Solr 1.4.

 

All this is because Nutch thinks that (url *is* the unique id for the
nutch document)

http://mysite.mydomain.com/index.html and http://mysite/index.html (the
difference is only in the alias and for an internal site both are valid)
are different documents depending on how the link is setup.  This is
reason for me to try deduplication.  I cannot submit SolrDedup command
from Nutch because non-Nutch generated documents do not have digest
filed populated and I read on the mailing lists that this will cause the
SolrDedup initiated from Nutch to fail.  This forced me to do try
deduplication on Solr side.

 

Thanks so much in advance for your help.





Here is my configuration:

 

SolrConfig.xml









updateRequestProcessorChain name=dedupe

processor



class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory



  bool name=enabledtrue/bool

  str name=signatureFieldsig/str

  bool name=overwriteDupesfalse/bool

  str



name=signatureClassorg.apache.solr.update.processor.Lookup3Signature

/str 

  str name=fieldsdigest/str

  /processor

processor class=solr.LogUpdateProcessorFactory /

processor class=solr.RunUpdateProcessorFactory /

  /updateRequestProcessorChain





requestHandler name=/update

class=solr.XmlUpdateRequestHandler 

   lst name=defaults

 str name=update.processordedupe/str

   /lst

 /requestHandler



Schema.xml





field name=sig type=string stored=true
indexed=true

multiValued=true /

 

Thanks so much for your help

 



RE: Solr Deduplication and Field Collpasing

2010-09-28 Thread Markus Jelsma
You could create a custom update processor that adds a digest field for newly 
added documents that do not have the digest field themselves. This way, the 
documents that are not added by Nutch get a proper non-empty digest field so 
the deduplication processor won't create the same empty hash and overwrite 
those. Or you could extend 
org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips 
documents with an empty digest field. I'd think the latter would be the 
quickest route but correct me if i'm wrong.

 

Cheers,
 
-Original message-
From: Nemani, Raj raj.nem...@turner.com
Sent: Tue 28-09-2010 23:28
To: solr-user@lucene.apache.org; 
Subject: Solr Deduplication and Field Collpasing

All,



I have setup Nutch to submit the crawl results to Solr index.  I have
some duplicates in the documents generated by the Nutch crawl.  There is
filed 'digest' that Nutch generates that is same for those documents
that are duplicates.  While setting up the the dedupe processor in the
Solr config file, I have used this 'Digest' field in the following
way(see below for config details).  Since my index has documents other
than the ones generated by Nutch I cannot use 'overwritedupes=true
because for non-Nutch generated documents the digest field will not be
populated and I found that Solr deletes every one of those documents
that do not have the digest filed populated. Probably because they all
will have the same 'sig' filed value generated based on an 'empty'
digest field forcing Solr to delete everything?



In any case, given the scenario I though I would set
'overwritedupes=false' and use field collapsing based on digest or sig
filed but I could not get filed collapsing to work.  Based on the wiki
documentation I was adding the query string
group=truegroup.filed=sig group=truegroup.filed=digest to my
over all query in admin console and I still got the duplicate documents
in the results.  Is there anything special I need to do to get field
collapsing working?  I am running Solr 1.4.



All this is because Nutch thinks that (url *is* the unique id for the
nutch document)

http://mysite.mydomain.com/index.html and http://mysite/index.html (the
difference is only in the alias and for an internal site both are valid)
are different documents depending on how the link is setup.  This is
reason for me to try deduplication.  I cannot submit SolrDedup command
from Nutch because non-Nutch generated documents do not have digest
filed populated and I read on the mailing lists that this will cause the
SolrDedup initiated from Nutch to fail.  This forced me to do try
deduplication on Solr side.



Thanks so much in advance for your help.





Here is my configuration:



SolrConfig.xml

               

               

               

               

               updateRequestProcessorChain name=dedupe

                   processor

               

class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

               

                     bool name=enabledtrue/bool

                     str name=signatureFieldsig/str

                     bool name=overwriteDupesfalse/bool

                     str

               

name=signatureClassorg.apache.solr.update.processor.Lookup3Signature

               /str 

                 str name=fieldsdigest/str

                 /processor

                   processor class=solr.LogUpdateProcessorFactory /

                   processor class=solr.RunUpdateProcessorFactory /

                 /updateRequestProcessorChain

               

               

               requestHandler name=/update

class=solr.XmlUpdateRequestHandler 

                  lst name=defaults

                    str name=update.processordedupe/str

                  /lst

                /requestHandler

               

               Schema.xml

               

               

               field name=sig type=string stored=true
indexed=true

               multiValued=true /



Thanks so much for your help





Re: Solr Deduplication and Field Collpasing

2010-09-28 Thread Nemani, Raj
I have the digest field already in the schema because the index is shared 
between nutch docs and others.  I do not know if the second approach is the 
quickest in my case.

I can set the digest value to something unique for non nutch documets easily (I 
have an I'd field that I can use to populate the digest field during indxing of 
new non_nutch documets.  I have custom tool that does the indexing of these 
docs).  But I have more than3 millon documents in the index already that I 
don't want start over with new indexing again if I don't have to. Is there a 
way I can update the digest field with the value from the corresponding I'd 
field using solr? 

Thanks
Raj

- Original Message -
From: Markus Jelsma markus.jel...@buyways.nl
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Tue Sep 28 18:19:17 2010
Subject: RE: Solr Deduplication and Field Collpasing

You could create a custom update processor that adds a digest field for newly 
added documents that do not have the digest field themselves. This way, the 
documents that are not added by Nutch get a proper non-empty digest field so 
the deduplication processor won't create the same empty hash and overwrite 
those. Or you could extend 
org.apache.solr.update.processor.SignatureUpdateProcessorFactory so it skips 
documents with an empty digest field. I'd think the latter would be the 
quickest route but correct me if i'm wrong.

 

Cheers,
 
-Original message-
From: Nemani, Raj raj.nem...@turner.com
Sent: Tue 28-09-2010 23:28
To: solr-user@lucene.apache.org; 
Subject: Solr Deduplication and Field Collpasing

All,



I have setup Nutch to submit the crawl results to Solr index.  I have
some duplicates in the documents generated by the Nutch crawl.  There is
filed 'digest' that Nutch generates that is same for those documents
that are duplicates.  While setting up the the dedupe processor in the
Solr config file, I have used this 'Digest' field in the following
way(see below for config details).  Since my index has documents other
than the ones generated by Nutch I cannot use 'overwritedupes=true
because for non-Nutch generated documents the digest field will not be
populated and I found that Solr deletes every one of those documents
that do not have the digest filed populated. Probably because they all
will have the same 'sig' filed value generated based on an 'empty'
digest field forcing Solr to delete everything?



In any case, given the scenario I though I would set
'overwritedupes=false' and use field collapsing based on digest or sig
filed but I could not get filed collapsing to work.  Based on the wiki
documentation I was adding the query string
group=truegroup.filed=sig group=truegroup.filed=digest to my
over all query in admin console and I still got the duplicate documents
in the results.  Is there anything special I need to do to get field
collapsing working?  I am running Solr 1.4.



All this is because Nutch thinks that (url *is* the unique id for the
nutch document)

http://mysite.mydomain.com/index.html and http://mysite/index.html (the
difference is only in the alias and for an internal site both are valid)
are different documents depending on how the link is setup.  This is
reason for me to try deduplication.  I cannot submit SolrDedup command
from Nutch because non-Nutch generated documents do not have digest
filed populated and I read on the mailing lists that this will cause the
SolrDedup initiated from Nutch to fail.  This forced me to do try
deduplication on Solr side.



Thanks so much in advance for your help.





Here is my configuration:



SolrConfig.xml

               

               

               

               

               updateRequestProcessorChain name=dedupe

                   processor

               

class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

               

                     bool name=enabledtrue/bool

                     str name=signatureFieldsig/str

                     bool name=overwriteDupesfalse/bool

                     str

               

name=signatureClassorg.apache.solr.update.processor.Lookup3Signature

               /str 

                 str name=fieldsdigest/str

                 /processor

                   processor class=solr.LogUpdateProcessorFactory /

                   processor class=solr.RunUpdateProcessorFactory /

                 /updateRequestProcessorChain

               

               

               requestHandler name=/update

class=solr.XmlUpdateRequestHandler 

                  lst name=defaults

                    str name=update.processordedupe/str

                  /lst

                /requestHandler

               

               Schema.xml

Re: Deduplication

2010-05-19 Thread Ahmet Arslan

 Basically for some uses cases I would like to show
 duplicates for other I
 wanted them ignored.
 
 If I have overwriteDupes=false and I just create the dedup
 hash how can I
 query for only unique hash values... ie something like a
 SQL group by. 

TermsComponent maybe? 

or faceting? 
q=*:*facet=truefacet.field=signatureFielddefType=lucenerows=0start=0

if you append facet.mincount=1 to above url you can see your duplications


  


Re: Deduplication

2010-05-19 Thread Ahmet Arslan
 TermsComponent maybe? 
 
 or faceting?
 q=*:*facet=truefacet.field=signatureFielddefType=lucenerows=0start=0
 
 if you append facet.mincount=1 to above url you can
 see your duplications
 

After re-reading your message: sometimes you want to show duplicates, sometimes 
you don't want them. I have never used FieldCollapsing by myself but heard 
about it many times.

http://wiki.apache.org/solr/FieldCollapsing


  


Deduplication

2010-05-18 Thread Blargy

Basically for some uses cases I would like to show duplicates for other I
wanted them ignored.

If I have overwriteDupes=false and I just create the dedup hash how can I
query for only unique hash values... ie something like a SQL group by. 

Thanks

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Deduplication-tp828016p828016.html
Sent from the Solr - User mailing list archive at Nabble.com.


Config issue for deduplication

2010-05-13 Thread Markus Fischer
I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. 
I followed:


http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.


What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number


---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str

  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /


--

If I look at the created field dedupeHash it seems to be empty...!?


Re: Config issue for deduplication

2010-05-13 Thread Ahmet Arslan
 I am trying to configure automatic
 deduplication for SOLR 1.4 in Vufind. I followed:
 
 http://wiki.apache.org/solr/Deduplication
 
 Actually nothing happens. All records are being imported
 without any deduplication.

Does being imported means you are using dataimporthandler? If yes you can use 
this to enable DIH with dedupe.

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.processordedupe/str
/lst
/requestHandler


  


Re: Config issue for deduplication

2010-05-13 Thread Markus Fischer
Hmm, I can't find in solrconfig.xml anything about dataimporthandler for 
Vufind.


So I suppose, no the import function does not use this method. Import is 
done by a script.


Maybe I do not associate

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

with the correct requestHandler?

I placed it directly after

requestHandler name=/update class=solr.XmlUpdateRequestHandler /

So kind of having twice this line.

Markus

Ahmet Arslan schrieb:

I am trying to configure automatic
deduplication for SOLR 1.4 in Vufind. I followed:

http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported
without any deduplication.


Does being imported means you are using dataimporthandler? If yes you can use 
this to enable DIH with dedupe.

requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdata-config.xml/str
str name=update.processordedupe/str
/lst
/requestHandler


  


RE: Config issue for deduplication

2010-05-13 Thread Markus Jelsma
What's your solrconfig? No deduplication is overwritesDedupes = false and 
signature field is other than doc ID field (unique) 
 
-Original message-
From: Markus Fischer i...@flyingfischer.ch
Sent: Thu 13-05-2010 17:01
To: solr-user@lucene.apache.org; 
Subject: Config issue for deduplication

I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. 
I followed:

http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.

What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number

---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
     str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str
  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /

--

If I look at the created field dedupeHash it seems to be empty...!?


Re: Config issue for deduplication

2010-05-13 Thread Markus Fischer

I use

bool name=overwriteDupestrue/bool

and a different field than ID to control duplication. This is about 
bibliographic data coming from different sources with different IDs 
which may have the same content...


I attached solrconfig.xml if you want to take a look.

Thanks a lot!

Markus

Markus Jelsma schrieb:
What's your solrconfig? No deduplication is overwritesDedupes = false and signature field is other than doc ID field (unique) 
 
-Original message-

From: Markus Fischer i...@flyingfischer.ch
Sent: Thu 13-05-2010 17:01
To: solr-user@lucene.apache.org; 
Subject: Config issue for deduplication


I am trying to configure automatic deduplication for SOLR 1.4 in Vufind. 
I followed:


http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.


What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number


---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str

  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /


--

If I look at the created field dedupeHash it seems to be empty...!?

?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

config
  !-- Set this to 'false' if you want solr to continue working after it has 
   encountered an severe configuration error.  In a production environment, 
   you may want solr to keep working even if one handler is mis-configured.

   You may also set this to false using by setting the system property:
 -Dsolr.abortOnConfigurationError=false
 --
  abortOnConfigurationError${solr.abortOnConfigurationError:false}/abortOnConfigurationError

  !-- Used to specify an alternate directory to hold all index data
   other than the default ./data under the Solr home.
   If replication is in use, this should match the replication configuration. --
  dataDir${solr.solr.home:./solr}/biblio/dataDir


  indexDefaults
   !-- Values here affect all index writers and act as a default unless overridden. --
useCompoundFilefalse/useCompoundFile

mergeFactor10/mergeFactor
!--
 If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.

 --
!--maxBufferedDocs1000/maxBufferedDocs--
!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at the cost of more RAM

If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.

--
ramBufferSizeMB32/ramBufferSizeMB
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout

!--
 Expert: Turn on Lucene's auto commit capability.

 TODO: Add recommendations on why you would want to do this.

 NOTE: Despite the name, this value does not have any relation to Solr's autoCommit functionality

 --
!--luceneAutoCommitfalse/luceneAutoCommit--
!--
 Expert:
 The Merge Policy in Lucene controls how merging is handled by Lucene.  The default in 2.3 is the LogByteSizeMergePolicy, previous
 versions used LogDocMergePolicy.

 LogByteSizeMergePolicy chooses segments to merge based on their size.  The Lucene 2.2 default, LogDocMergePolicy chose when
 to merge based on number of documents

 Other implementations of MergePolicy must have a no-argument constructor
 --
!--mergePolicyorg.apache.lucene.index.LogByteSizeMergePolicy

Re: [resolved] Config issue for deduplication

2010-05-13 Thread Markus Fischer

Got it with the help of Demian Katz, main developper of Vufind:

The import script of Vufind was bypassing the duplication parameters 
while writing directly to the SOLR-Index.


By deactivitating direct writing to the index and using the standard way 
it now works!


Thanks to all who gave input!

Markus

Markus Fischer schrieb:

I use

bool name=overwriteDupestrue/bool

and a different field than ID to control duplication. This is about 
bibliographic data coming from different sources with different IDs 
which may have the same content...


I attached solrconfig.xml if you want to take a look.

Thanks a lot!

Markus

Markus Jelsma schrieb:
What's your solrconfig? No deduplication is overwritesDedupes = false 
and signature field is other than doc ID field (unique)  
-Original message-

From: Markus Fischer i...@flyingfischer.ch
Sent: Thu 13-05-2010 17:01
To: solr-user@lucene.apache.org; Subject: Config issue for deduplication

I am trying to configure automatic deduplication for SOLR 1.4 in 
Vufind. I followed:


http://wiki.apache.org/solr/Deduplication

Actually nothing happens. All records are being imported without any 
deduplication.


What am I missing?

Thanks
Markus

I did:

- create a duplicated set of records, only shifted their ID by a fixed 
number


---
solrconfig.xml

requestHandler name=/update class=solr.XmlUpdateRequestHandler 
 lst name=defaults
 str name=update.processordedupe/str
 /lst
/requestHandler

updateRequestProcessorChain name=dedupe
  processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

  bool name=enabledtrue/bool
  bool name=overwriteDupestrue/bool
  str name=signatureFielddedupeHash/str
  str name=fieldsreference,issn/str
  str 
name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str 


  /processor
  processor class=solr.LogUpdateProcessorFactory /
  processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

---
In schema.xml I added the field

field name=dedupeHash type=string stored=true indexed=true 
multiValued=false /


--

If I look at the created field dedupeHash it seems to be empty...!?



Re: Solr Cell and Deduplication - Get ID of doc

2010-03-02 Thread Bill Engle
Thanks for the responses.  This is exactly what I had to resort to.  I will
definitely put in a feature request to get the generated ID back from the
extract request.

I am doing this with PHP cURL for extraction and pecl php solr for
querying.  I am then saving the unique id and dupe hash in a MySQL table
which I check against after the doc is indexed in Solr.  If it is a dupe I
delete the Solr record and discard the file.  My problem now is the dupe
hash sometimes comes back NULL from Solr although when I check it through
Solr Admin it is there.  I am working through this now to isolate.

I had to set Solr to ALLOW duplicates because I have to somehow know that
the file is a dupe and then remove the duplicate files on my filesystem.
Based on the extract response I have no way of knowing this if duplicates
are disallowed.

-Bill


On Tue, Mar 2, 2010 at 2:11 AM, Chris Hostetter hossman_luc...@fucit.orgwrote:



 : To quote from the wiki,
...
 That's all true ... but Bill explicitly said he wanted to use
 SignatureUpdateProcessorFactory to generate a uniqueKey from the content
 field post-extraction so he could dedup documents with the same content
 ... his question was how to get that key after adding a doc.

 Using a unique literal.field value will work -- but only as the value of
 a secondary field that he can then query on to get the uniqueKeyField
 value.


 :  : You could create your own unique ID and pass it in with the
 :  : literal.field=value feature.
 : 
 :  By which Lance means you could specify an unique value in a differnet
 :  field from yoru uniqueKey field, and then query on that field:value
 pair
 :  to get the doc after it's been added -- but that query will only work
 :  until some other version of the doc (with some other value) overwrites
 it.
 :  so you'd esentially have to query for the field:value to lookup the
 :  uniqueKey.
 : 
 :  it seems like it should definitely be feasible for the
 :  Update RequestHandlers to return the uniqueKeyField values for all the
 :  added docs (regardless of wether the key was included in the request,
 or
 :  added by an UpdateProcessor -- but i'm not sure how that would fit in
 with
 :  the SolrJ API.
 : 
 :  would you mind opening a feature request in Jira?
 : 
 : 
 : 
 :  -Hoss
 : 
 : 
 :
 :
 :
 : --
 : Lance Norskog
 : goks...@gmail.com
 :



 -Hoss




Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter

: You could create your own unique ID and pass it in with the
: literal.field=value feature.

By which Lance means you could specify an unique value in a differnet 
field from yoru uniqueKey field, and then query on that field:value pair 
to get the doc after it's been added -- but that query will only work 
until some other version of the doc (with some other value) overwrites it.  
so you'd esentially have to query for the field:value to lookup the 
uniqueKey.

it seems like it should definitely be feasible for the 
Update RequestHandlers to return the uniqueKeyField values for all the 
added docs (regardless of wether the key was included in the request, or 
added by an UpdateProcessor -- but i'm not sure how that would fit in with 
the SolrJ API.

would you mind opening a feature request in Jira?



-Hoss



Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Lance Norskog
To quote from the wiki,
http://wiki.apache.org/solr/ExtractingRequestHandler

curl 'http://localhost:8983/solr/update/extract?literal.id=doc1commit=true'
-F myfi...@tutorial.html

This runs the extractor on your input file (in this case an HTML
file). It then stores the generated document with the id field (the
uniqueKey declared in schema.xml) set to 'doc1'. This way, you do not
rely on the ExtractingRequestHandler to create a unique key for you.
This command throws away that generated key.

On Mon, Mar 1, 2010 at 4:22 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : You could create your own unique ID and pass it in with the
 : literal.field=value feature.

 By which Lance means you could specify an unique value in a differnet
 field from yoru uniqueKey field, and then query on that field:value pair
 to get the doc after it's been added -- but that query will only work
 until some other version of the doc (with some other value) overwrites it.
 so you'd esentially have to query for the field:value to lookup the
 uniqueKey.

 it seems like it should definitely be feasible for the
 Update RequestHandlers to return the uniqueKeyField values for all the
 added docs (regardless of wether the key was included in the request, or
 added by an UpdateProcessor -- but i'm not sure how that would fit in with
 the SolrJ API.

 would you mind opening a feature request in Jira?



 -Hoss





-- 
Lance Norskog
goks...@gmail.com


Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter


: To quote from the wiki,
...
That's all true ... but Bill explicitly said he wanted to use 
SignatureUpdateProcessorFactory to generate a uniqueKey from the content 
field post-extraction so he could dedup documents with the same content 
... his question was how to get that key after adding a doc.

Using a unique literal.field value will work -- but only as the value of 
a secondary field that he can then query on to get the uniqueKeyField 
value.


:  : You could create your own unique ID and pass it in with the
:  : literal.field=value feature.
: 
:  By which Lance means you could specify an unique value in a differnet
:  field from yoru uniqueKey field, and then query on that field:value pair
:  to get the doc after it's been added -- but that query will only work
:  until some other version of the doc (with some other value) overwrites it.
:  so you'd esentially have to query for the field:value to lookup the
:  uniqueKey.
: 
:  it seems like it should definitely be feasible for the
:  Update RequestHandlers to return the uniqueKeyField values for all the
:  added docs (regardless of wether the key was included in the request, or
:  added by an UpdateProcessor -- but i'm not sure how that would fit in with
:  the SolrJ API.
: 
:  would you mind opening a feature request in Jira?
: 
: 
: 
:  -Hoss
: 
: 
: 
: 
: 
: -- 
: Lance Norskog
: goks...@gmail.com
: 



-Hoss



Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Bill Engle
Any thoughts on this? I would like to get the id back in the request after
indexing.  My initial thoughts were to do a search to get the docid  based
on the attr_stream_name after indexing but now that I reread my message I
mentioned the attr_stream_name (file_name) may be different so that is
unreliable.  My only option is to somehow return the id in the XML
response.  Any guidance is greatly appreciated.

-Bill

On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote:

 Hi -

 New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
 txt, htm, etc.) and there is a good chance that a new file will have
 duplicate content but not necessarily the same file name.  To avoid this I
 am using the deduplication feature of Solr.

   updateRequestProcessorChain name=dedupe
 processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
   bool name=enabledtrue/bool
   str name=signatureFieldid/str
   bool name=overwriteDupestrue/bool
   str name=fieldsattr_content/str
   str name=signatureClassorg.apache.solr.update.processor./str
 /processor
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

 How do I get the id value post Solr processing.  Is there someway to
 modify the curl response so that id is returned.  I need this id because I
 would like to rename the file to the id value.  I could probably do a Solr
 search after the fact to get the id field based on the attr_stream_name but
 I would like to do only one request.

 curl '
 http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
 -F myfi...@myfile.pdf

 Thanks,
 Bill



Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Lance Norskog
You could create your own unique ID and pass it in with the
literal.field=value feature.

http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters

On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle billengle...@gmail.com wrote:
 Any thoughts on this? I would like to get the id back in the request after
 indexing.  My initial thoughts were to do a search to get the docid  based
 on the attr_stream_name after indexing but now that I reread my message I
 mentioned the attr_stream_name (file_name) may be different so that is
 unreliable.  My only option is to somehow return the id in the XML
 response.  Any guidance is greatly appreciated.

 -Bill

 On Wed, Feb 24, 2010 at 12:06 PM, Bill Engle billengle...@gmail.com wrote:

 Hi -

 New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
 txt, htm, etc.) and there is a good chance that a new file will have
 duplicate content but not necessarily the same file name.  To avoid this I
 am using the deduplication feature of Solr.

   updateRequestProcessorChain name=dedupe
     processor
 class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
       bool name=enabledtrue/bool
       str name=signatureFieldid/str
       bool name=overwriteDupestrue/bool
       str name=fieldsattr_content/str
       str name=signatureClassorg.apache.solr.update.processor./str
     /processor
     processor class=solr.LogUpdateProcessorFactory /
     processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain

 How do I get the id value post Solr processing.  Is there someway to
 modify the curl response so that id is returned.  I need this id because I
 would like to rename the file to the id value.  I could probably do a Solr
 search after the fact to get the id field based on the attr_stream_name but
 I would like to do only one request.

 curl '
 http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
 -F myfi...@myfile.pdf

 Thanks,
 Bill





-- 
Lance Norskog
goks...@gmail.com


Solr Cell and Deduplication - Get ID of doc

2010-02-24 Thread Bill Engle
Hi -

New Solr user here.  I am using Solr Cell to index files (PDF, doc, docx,
txt, htm, etc.) and there is a good chance that a new file will have
duplicate content but not necessarily the same file name.  To avoid this I
am using the deduplication feature of Solr.

  updateRequestProcessorChain name=dedupe
processor
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory
  bool name=enabledtrue/bool
  str name=signatureFieldid/str
  bool name=overwriteDupestrue/bool
  str name=fieldsattr_content/str
  str name=signatureClassorg.apache.solr.update.processor./str
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

How do I get the id value post Solr processing.  Is there someway to
modify the curl response so that id is returned.  I need this id because I
would like to rename the file to the id value.  I could probably do a Solr
search after the fact to get the id field based on the attr_stream_name but
I would like to do only one request.

curl '
http://localhost:8080/solr/update/extract?uprefix=attr_fmap.content=attr_contentcommit=true'
-F myfi...@myfile.pdf

Thanks,
Bill


Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.

Martijn

2009/11/26 KaktuChakarabati jimmoe...@gmail.com:

 Hey Otis,
 Yep, I realized this myself after playing some with the dedupe feature
 yesterday.
 So it does look like Field collapsing is what I need pretty much.
 Any idea on how close it is to being production-ready?

 Thanks,
 -Chak

 Otis Gospodnetic wrote:

 Hi,

 As far as I know, the point of deduplication in Solr (
 http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
 document before indexing it in order to avoid duplicates in the index in
 the first place.

 What you are describing is closer to field collapsing patch in SOLR-236.

  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



 - Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4


 Hey,
 I've been trying to find some documentation on using this feature in 1.4
 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:

 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.

 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).

 I enabled the deduplication processor and put it into updater, but i'm
 not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)

 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?

 Anything will be helpful..

 Thanks,
 Chak
 --
 View this message in context:
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Deduplication in 1.4

2009-11-26 Thread Otis Gospodnetic
Hi Martijn,

 
- Original Message 

 From: Martijn v Groningen martijn.is.h...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, November 26, 2009 3:19:40 AM
 Subject: Re: Deduplication in 1.4
 
 Field collapsing has been used by many in their production
 environment. 

Got any pointers to public sites you know use it?  I know of a high traffic 
site that used an early version, and it caused performance problems.  Is 
double-tripping still required?

 The last few months the stability of the patch grew as
 quiet some bugs were fixed. The only big feature missing currently is
 caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

 I will put it in a new patch in the coming next days.  So yes the
 patch is very near being production ready.

Thanks,
Otis

 Martijn
 
 2009/11/26 KaktuChakarabati :
 
  Hey Otis,
  Yep, I realized this myself after playing some with the dedupe feature
  yesterday.
  So it does look like Field collapsing is what I need pretty much.
  Any idea on how close it is to being production-ready?
 
  Thanks,
  -Chak
 
  Otis Gospodnetic wrote:
 
  Hi,
 
  As far as I know, the point of deduplication in Solr (
  http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
  document before indexing it in order to avoid duplicates in the index in
  the first place.
 
  What you are describing is closer to field collapsing patch in SOLR-236.
 
   Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: KaktuChakarabati 
  To: solr-user@lucene.apache.org
  Sent: Tue, November 24, 2009 5:29:00 PM
  Subject: Deduplication in 1.4
 
 
  Hey,
  I've been trying to find some documentation on using this feature in 1.4
  but
  Wiki page is alittle sparse..
  In specific, here's what i'm trying to do:
 
  I have a field, say 'duplicate_group_id' that i'll populate based on some
  offline documents deduplication process I have.
 
  All I want is for solr to compute a 'duplicate_signature' field based on
  this one at update time, so that when i search for documents later, all
  documents with same original 'duplicate_group_id' value will be rolled up
  (e.g i'll just get the first one that came back  according to relevancy).
 
  I enabled the deduplication processor and put it into updater, but i'm
  not
  seeing any difference in returned results (i.e results with same
  duplicate_id are returned separately..)
 
  is there anything i need to supply in query-time for this to take effect?
  what should be the behaviour? is there any working example of this?
 
  Anything will be helpful..
 
  Thanks,
  Chak
  --
  View this message in context:
  http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 



Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.

Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic otis_gospodne...@yahoo.com:
 Hi Martijn,


 - Original Message 

 From: Martijn v Groningen martijn.is.h...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Thu, November 26, 2009 3:19:40 AM
 Subject: Re: Deduplication in 1.4

 Field collapsing has been used by many in their production
 environment.

 Got any pointers to public sites you know use it?  I know of a high traffic 
 site that used an early version, and it caused performance problems.  Is 
 double-tripping still required?

 The last few months the stability of the patch grew as
 quiet some bugs were fixed. The only big feature missing currently is
 caching of the collapsing algorithm. I'm currently working on that and

 Is it also full distributed-search-ready?

 I will put it in a new patch in the coming next days.  So yes the
 patch is very near being production ready.

 Thanks,
 Otis

 Martijn

 2009/11/26 KaktuChakarabati :
 
  Hey Otis,
  Yep, I realized this myself after playing some with the dedupe feature
  yesterday.
  So it does look like Field collapsing is what I need pretty much.
  Any idea on how close it is to being production-ready?
 
  Thanks,
  -Chak
 
  Otis Gospodnetic wrote:
 
  Hi,
 
  As far as I know, the point of deduplication in Solr (
  http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
  document before indexing it in order to avoid duplicates in the index in
  the first place.
 
  What you are describing is closer to field collapsing patch in SOLR-236.
 
   Otis
  --
  Sematext is hiring -- http://sematext.com/about/jobs.html?mls
  Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
  - Original Message 
  From: KaktuChakarabati
  To: solr-user@lucene.apache.org
  Sent: Tue, November 24, 2009 5:29:00 PM
  Subject: Deduplication in 1.4
 
 
  Hey,
  I've been trying to find some documentation on using this feature in 1.4
  but
  Wiki page is alittle sparse..
  In specific, here's what i'm trying to do:
 
  I have a field, say 'duplicate_group_id' that i'll populate based on some
  offline documents deduplication process I have.
 
  All I want is for solr to compute a 'duplicate_signature' field based on
  this one at update time, so that when i search for documents later, all
  documents with same original 'duplicate_group_id' value will be rolled up
  (e.g i'll just get the first one that came back  according to relevancy).
 
  I enabled the deduplication processor and put it into updater, but i'm
  not
  seeing any difference in returned results (i.e results with same
  duplicate_id are returned separately..)
 
  is there anything i need to supply in query-time for this to take effect?
  what should be the behaviour? is there any working example of this?
 
  Anything will be helpful..
 
  Thanks,
  Chak
  --
  View this message in context:
  http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
  --
  View this message in context:
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 




Re: Deduplication in 1.4

2009-11-25 Thread KaktuChakarabati

Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature
yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?

Thanks,
-Chak

Otis Gospodnetic wrote:
 
 Hi,
 
 As far as I know, the point of deduplication in Solr (
 http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
 document before indexing it in order to avoid duplicates in the index in
 the first place.
 
 What you are describing is closer to field collapsing patch in SOLR-236.
 
  Otis
 --
 Sematext is hiring -- http://sematext.com/about/jobs.html?mls
 Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
 
 
 
 - Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4
 
 
 Hey,
 I've been trying to find some documentation on using this feature in 1.4
 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:
 
 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.
 
 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).
 
 I enabled the deduplication processor and put it into updater, but i'm
 not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)
 
 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?
 
 Anything will be helpful..
 
 Thanks,
 Chak
 -- 
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
Sent from the Solr - User mailing list archive at Nabble.com.



Deduplication in 1.4

2009-11-24 Thread KaktuChakarabati

Hey,
I've been trying to find some documentation on using this feature in 1.4 but
Wiki page is alittle sparse..
In specific, here's what i'm trying to do:

I have a field, say 'duplicate_group_id' that i'll populate based on some
offline documents deduplication process I have.

All I want is for solr to compute a 'duplicate_signature' field based on
this one at update time, so that when i search for documents later, all
documents with same original 'duplicate_group_id' value will be rolled up
(e.g i'll just get the first one that came back  according to relevancy).

I enabled the deduplication processor and put it into updater, but i'm not
seeing any difference in returned results (i.e results with same
duplicate_id are returned separately..)

is there anything i need to supply in query-time for this to take effect?
what should be the behaviour? is there any working example of this?

Anything will be helpful..

Thanks,
Chak
-- 
View this message in context: 
http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication in 1.4

2009-11-24 Thread Otis Gospodnetic
Hi,

As far as I know, the point of deduplication in Solr ( 
http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document 
before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: KaktuChakarabati jimmoe...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, November 24, 2009 5:29:00 PM
 Subject: Deduplication in 1.4
 
 
 Hey,
 I've been trying to find some documentation on using this feature in 1.4 but
 Wiki page is alittle sparse..
 In specific, here's what i'm trying to do:
 
 I have a field, say 'duplicate_group_id' that i'll populate based on some
 offline documents deduplication process I have.
 
 All I want is for solr to compute a 'duplicate_signature' field based on
 this one at update time, so that when i search for documents later, all
 documents with same original 'duplicate_group_id' value will be rolled up
 (e.g i'll just get the first one that came back  according to relevancy).
 
 I enabled the deduplication processor and put it into updater, but i'm not
 seeing any difference in returned results (i.e results with same
 duplicate_id are returned separately..)
 
 is there anything i need to supply in query-time for this to take effect?
 what should be the behaviour? is there any working example of this?
 
 Anything will be helpful..
 
 Thanks,
 Chak
 -- 
 View this message in context: 
 http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Conditional deduplication

2009-09-30 Thread Michael
If I index a bunch of email documents, is there a way to sayshow me all
email documents, but only one per To: email address
so that if there are a total of 10 distinct To: fields in the corpus, I get
back 10 email documents?

I'm aware of http://wiki.apache.org/solr/Deduplication but I want to retain
the ability to search across all of my email documents most of the time, and
only occasionally search for the distinct ones.

Essentially I want to do a
SELECT DISTINCT to_field FROM documents
where a normal search is a
SELECT * FROM documents

Thanks for any pointers.


Re: Conditional deduplication

2009-09-30 Thread Mauricio Scheffer
See http://wiki.apache.org/solr/FieldCollapsing

On Wed, Sep 30, 2009 at 4:41 PM, Michael solrco...@gmail.com wrote:

 If I index a bunch of email documents, is there a way to sayshow me all
 email documents, but only one per To: email address
 so that if there are a total of 10 distinct To: fields in the corpus, I get
 back 10 email documents?

 I'm aware of http://wiki.apache.org/solr/Deduplication but I want to
 retain
 the ability to search across all of my email documents most of the time,
 and
 only occasionally search for the distinct ones.

 Essentially I want to do a
 SELECT DISTINCT to_field FROM documents
 where a normal search is a
 SELECT * FROM documents

 Thanks for any pointers.



Re: stress tests to DIH and deduplication patch

2009-04-30 Thread Marc Sturlese

I have already ran out of memory after a cronjob indexing as much times as
possible during a day.
Will activate GC loggin to see what it says...
Thnks!


Shalin Shekhar Mangar wrote:
 
 On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 Hey there, I am doing some stress tests indexing with DIH.
 I am indexing a mysql DB with 140 rows aprox. I am using also the
 DeDuplication patch.
 I am using tomcat with JVM limit of -Xms2000M -Xmx2000M
 I have indexed 3 times using full-import command without restarting
 tomcat
 or reloading the core between the indexations.
 I have used jmap and jhat to map heap memory in some moments of the
 indexations.
 Here I show the beginig of the maps (I don't show the lower part of the
 stack because object instance numbers are completely stable in there).
 I have noticed that the number of Term, TermInfo and TermQuery grows
 between
 an indexation and another... is that normal?


 Perhaps you should enable GC logging as well. Also, did you actually run
 out
 of memory or you are interpolating and assuming that it might happen?
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/stress-tests-to-DIH-and-deduplication-patch-tp23295926p23314604.html
Sent from the Solr - User mailing list archive at Nabble.com.



stress tests to DIH and deduplication patch

2009-04-29 Thread Marc Sturlese

Hey there, I am doing some stress tests indexing with DIH.
I am indexing a mysql DB with 140 rows aprox. I am using also the
DeDuplication patch.
I am using tomcat with JVM limit of -Xms2000M -Xmx2000M
I have indexed 3 times using full-import command without restarting tomcat
or reloading the core between the indexations.
I have used jmap and jhat to map heap memory in some moments of the
indexations.
Here I show the beginig of the maps (I don't show the lower part of the
stack because object instance numbers are completely stable in there).
I have noticed that the number of Term, TermInfo and TermQuery grows between
an indexation and another... is that normal?



FIRST TIME I INDEX... WITH A MILION INDEXED DOCS APROX... HERE INDEXING
PROCESS IS STILL RUNNING
268290 instances of class org.apache.lucene.index.Term
215943 instances of class org.apache.lucene.index.TermInfo
129649 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
51537 instances of class org.apache.lucene.search.TermQuery
25457 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
1120 instances of class org.apache.lucene.index.FieldInfo
919 instances of class org.apache.catalina.loader.ResourceEntry 


FIRST TIME I INDEX, COMPLETED (1.4 MILION DOCS INDEXED)
552522 instances of class org.apache.lucene.index.Term
505835 instances of class org.apache.lucene.index.TermInfo
128937 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
48645 instances of class org.apache.lucene.search.TermQuery
24065 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
1470 instances of class org.apache.lucene.index.FieldInfo
923 instances of class org.apache.catalina.loader.ResourceEntry
858 instances of class com.sun.tools.javac.util.List 


SECOND TIME I INDEX WITH 50 INDEXED DOCS... HERE INDEX PROCESS IS STILL
RUNNING 
264617 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
262496 instances of class org.apache.lucene.index.Term
116078 instances of class org.apache.lucene.index.TermInfo
53383 instances of class org.apache.lucene.search.TermQuery
42274 instances of class
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
30230 instances of class org.apache.lucene.search.TermQuery$TermWeight
26044 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
15115 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator
15115 instances of class org.apache.lucene.search.ReqExclScorer
7325 instances of class org.apache.lucene.search.ConjunctionScorer$1
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
1279 instances of class org.apache.lucene.index.FieldInfo
923 instances of class org.apache.catalina.loader.ResourceEntry 


SECOND TIME I INDEX WITH 120 INDEXED DOCS... HERE INDEX PROCESS IS STILL
RUNNING 
574603 instances of class org.apache.lucene.index.Term
423558 instances of class org.apache.lucene.index.TermInfo
141394 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
106729 instances of class org.apache.lucene.search.TermQuery
54858 instances of class org.apache.lucene.index.BufferedDeletes$Num
25347 instances of class
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
11587 instances of class org.apache.lucene.search.TermQuery$TermWeight
5793 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator
5793 instances of class org.apache.lucene.search.ReqExclScorer
2922 instances of class org.apache.lucene.search.ConjunctionScorer$1
2170 instances of class org.apache.lucene.index.FieldInfo
1569 instances of class com.sun.tools.javac.zip.ZipFileIndex$DirectoryEntry
923 instances of class org.apache.catalina.loader.ResourceEntry
858 instances of class com.sun.tools.javac.util.List 

SECOND TIME I INDEX, COMPLETED (1.4 MILION DOCS INDEXED)
999753 instances of class org.apache.lucene.index.Term
808190 instances of class org.apache.lucene.index.TermInfo
156511 instances of class org.apache.lucene.search.TermQuery
128975 instances of class
org.apache.lucene.index.FreqProxTermsWriter$PostingList
104396 instances of class org.apache.lucene.index.BufferedDeletes$Num
23233 instances of class com.sun.tools.javac.zip.ZipFileIndexEntry
15401 instances of class
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput
14896 instances of class org.apache.lucene.search.TermQuery$TermWeight
7447 instances of class org.apache.lucene.search.BooleanScorer2$Coordinator
7447 instances of class org.apache.lucene.search.ReqExclScorer
3025 instances of class

Re: stress tests to DIH and deduplication patch

2009-04-29 Thread Shalin Shekhar Mangar
On Wed, Apr 29, 2009 at 7:44 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 Hey there, I am doing some stress tests indexing with DIH.
 I am indexing a mysql DB with 140 rows aprox. I am using also the
 DeDuplication patch.
 I am using tomcat with JVM limit of -Xms2000M -Xmx2000M
 I have indexed 3 times using full-import command without restarting tomcat
 or reloading the core between the indexations.
 I have used jmap and jhat to map heap memory in some moments of the
 indexations.
 Here I show the beginig of the maps (I don't show the lower part of the
 stack because object instance numbers are completely stable in there).
 I have noticed that the number of Term, TermInfo and TermQuery grows
 between
 an indexation and another... is that normal?


Perhaps you should enable GC logging as well. Also, did you actually run out
of memory or you are interpolating and assuming that it might happen?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Deduplication patch not working in nightly build

2009-01-10 Thread Grant Ingersoll
I've seen similar errors when large background merges happen while  
looping in a result set.  See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/




On Jan 9, 2009, at 12:50 PM, Mark Miller wrote:

Your basically writing segments more often now, and somehow avoiding  
a longer merge I think. Also, likely, deduplication is probably  
adding enough extra data to your index to hit a sweet spot where a  
merge is too long. Or something to that effect - MySql is especially  
sensitive to timeouts when doing a select * on a huge db in my  
testing. I didnt understand your answer on the autocommit - I take  
it you are using it? Or no?


All a guess, but it def points to a merge taking a bit long and  
causing a timeout. I think you can relax the MySql timeout settings  
if that is it.


I'd like to get to the bottom of this as well, so any other info you  
can provide would be great.


- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had  
ramBufferSizeMB32/ramBufferSizeMB

and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk  
writting
than I would like... but at least it works fine (but a bit  
slower,opiously).


I keep saying that the most weird thing is that I don't have that  
problem

using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if  
someone can

give me an explanation why this is happening


Shalin Shekhar Mangar wrote:


On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:



hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in  
solrconfig.xml. I

can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do  
the same?





What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got  
closed by

the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and  
what did

you change it to?




--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Regards,
Shalin Shekhar Mangar.










--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/my_db

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
I can't imagine why dedupe would have anything to do with this, other 
than what was said, it perhaps is taking a bit longer to get a document 
to the db, and it times out (maybe a long signature calculation?). Have 
you tried changing your MySql settings to allow for a longer timeout? 
(sorry, I'm not to up to date on what you have tried).


Also, are you using autocommit during the import? If so, you might try 
turning it off for the full import.


- Mark

Marc Sturlese wrote:

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/my_db

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url=jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 


java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

hey there,
I hadn't autoCommit set to true but I have it sorted! The error stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?

Thanks


Marc Sturlese wrote:
 
 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was working perfectly. Now I upgraded to a nigthly build (20th december)
 to be able to use new facet algorithm and other stuff and DeDuplication is
 not working any more. I have followed exactly the same steps to apply the
 patch to the source code. I am geting this error:
 
 WARNING: Error reading data 
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception: 
 
 ** BEGIN NESTED EXCEPTION ** 
 
 java.io.EOFException
 
 STACKTRACE:
 
 java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception: 
 
 ** BEGIN NESTED EXCEPTION ** 
 
 java.io.EOFException
 
 STACKTRACE:
 
 java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Shalin Shekhar Mangar
On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese marc.sturl...@gmail.comwrote:


 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml. I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?



 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey Shalin,

In the begining (when the error was appearing) i had 
ramBufferSizeMB32/ramBufferSizeMB
and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening
 


Shalin Shekhar Mangar wrote:
 
 On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error
 stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml. I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


 What I find strange is this line in the exception:
 Last packet sent to the server was 202481 ms ago.
 
 Something took very very long to complete and the connection got closed by
 the time the next row was fetched from the opened resultset.
 
 Just curious, what was the previous value of maxBufferedDocs and what did
 you change it to?
 
 

 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 

-- 
View this message in context: 
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21376235.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
Your basically writing segments more often now, and somehow avoiding a 
longer merge I think. Also, likely, deduplication is probably adding 
enough extra data to your index to hit a sweet spot where a merge is too 
long. Or something to that effect - MySql is especially sensitive to 
timeouts when doing a select * on a huge db in my testing. I didnt 
understand your answer on the autocommit - I take it you are using it? 
Or no?


All a guess, but it def points to a merge taking a bit long and causing 
a timeout. I think you can relax the MySql timeout settings if that is it.


I'd like to get to the bottom of this as well, so any other info you can 
provide would be great.


- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had 
ramBufferSizeMB32/ramBufferSizeMB

and no maxBufferedDocs set

Now I have:
ramBufferSizeMB32/ramBufferSizeMB
maxBufferedDocs50/maxBufferedDocs

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening
 



Shalin Shekhar Mangar wrote:
  

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
marc.sturl...@gmail.comwrote:



hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


  

What I find strange is this line in the exception:
Last packet sent to the server was 202481 ms ago.

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?




--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

--
Regards,
Shalin Shekhar Mangar.





  




Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey Mark,
Sorry I was not enough especific, I wanted to mean that I have and I always
had autoCommit=false.
I will do some more traces and test. Will post if I have any new important
thing to mention.

Thanks.


Marc Sturlese wrote:
 
 Hey Shalin,
 
 In the begining (when the error was appearing) i had 
 ramBufferSizeMB32/ramBufferSizeMB
 and no maxBufferedDocs set
 
 Now I have:
 ramBufferSizeMB32/ramBufferSizeMB
 maxBufferedDocs50/maxBufferedDocs
 
 I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
 than I would like... but at least it works fine (but a bit
 slower,opiously).
 
 I keep saying that the most weird thing is that I don't have that problem
 using solr1.3, just with the nightly...
 
 Even that it's good that it works well now, would be great if someone can
 give me an explanation why this is happening
  
 
 
 Shalin Shekhar Mangar wrote:
 
 On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:
 

 hey there,
 I hadn't autoCommit set to true but I have it sorted! The error
 stopped
 appearing after setting the property maxBufferedDocs in solrconfig.xml.
 I
 can't exactly undersand why but it just worked.
 Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the
 same?


 What I find strange is this line in the exception:
 Last packet sent to the server was 202481 ms ago.
 
 Something took very very long to complete and the connection got closed
 by
 the time the next row was fetched from the opened resultset.
 
 Just curious, what was the previous value of maxBufferedDocs and what did
 you change it to?
 
 

 --
 View this message in context:
 http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
 Sent from the Solr - User mailing list archive at Nabble.com.


 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21378069.html
Sent from the Solr - User mailing list archive at Nabble.com.



Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Hey there,
I was using the Deduplication patch with Solr 1.3 release and everything was
working perfectly. Now I upgraded to a nigthly build (20th december) to be
able to use new facet algorithm and other stuff and DeDuplication is not
working any more. I have followed exactly the same steps to apply the patch
to the source code. I am geting this error:

WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


** END NESTED EXCEPTION **
Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150)
at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488)
at com.mysql.jdbc.ResultSet.close(ResultSet.java:736)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.close(JdbcDataSource.java:312

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Thanks I will have a look to my JdbcDataSource. Anyway it's weird because
using the 1.3 release I don't have that problem...

Shalin Shekhar Mangar wrote:
 
 Yes, initially I figured that we are accidentally re-using a closed data
 source. But Noble has pinned it right. I guess you can try looking into
 your
 JDBC driver's documentation for a setting which increases the connection
 alive-ness.
 
 On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 noble.p...@gmail.com wrote:
 
 I guess the indexing of a doc is taking too long (may be because of
 the de-dup patch) and the resultset gets closed automaticallly (timed
 out)
 --Noble

 On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Donig this fix I get the same error :(
 
  I am going to try to set up the last nigthly build... let's see if I
 have
  better luck.
 
  The thing is it stop indexing at the doc num 150.000 aprox... and give
 me
  that mysql exception error... Without DeDuplication patch I can index 2
  milion docs without problems...
 
  I am pretty lost with this... :(
 
 
  Shalin Shekhar Mangar wrote:
 
  Yes I meant the 05/01/2008 build. The fix is a one line change
 
  Add the following as the last line of DataConfig.Entity.clearCache()
  dataSrc = null;
 
 
 
  On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
  marc.sturl...@gmail.comwrote:
 
 
  Shalin you mean I should test the 05/01/2008 nighlty? maybe with this
 one
  works? If the fix you did is not really big can u tell me where in
 the
  source is and what is it for? (I have been debuging and tracing a lot
 the
  dataimporthandler source and I I would like to know what the
 imporovement
  is
  about if it is not a problem...)
 
  Thanks!
 
 
  Shalin Shekhar Mangar wrote:
  
   Marc, I've just committed a fix which may have caused the bug. Can
 you
  use
   svn trunk (or the next nightly build) and confirm?
  
   On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
   noble.p...@gmail.com wrote:
  
   looks like a bug w/ DIH with the recent fixes.
   --Noble
  
   On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
  marc.sturl...@gmail.com
   wrote:
   
Hey there,
I was using the Deduplication patch with Solr 1.3 release and
   everything
   was
working perfectly. Now I upgraded to a nigthly build (20th
 december)
  to
   be
able to use new facet algorithm and other stuff and
 DeDuplication
 is
   not
working any more. I have followed exactly the same steps to
 apply
  the
   patch
to the source code. I am geting this error:
   
WARNING: Error reading data
com.mysql.jdbc.CommunicationsException: Communications link
 failure
  due
   to
underlying exception:
   
** BEGIN NESTED EXCEPTION **
   
java.io.EOFException
   
STACKTRACE:
   
java.io.EOFException
   at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
   at
   com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
   at
  com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
   at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
   at
   
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
   at
   
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
   at
   
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
   at
   
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
   at
   
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
   
   
** END NESTED EXCEPTION **
Last packet sent to the server was 202481 ms ago.
   at
  com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
   at
 com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese


Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect.  I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandler classes... it's
coz I thought the problem was not there (but can't say it for sure...)

I was thinking that the problem has something to do with the
UpdateRequestProcessorChain but don't know how this part of the source
works...

I am really interested in updating to the nightly build as I think new facet
algorithm and  SolrDeletionPolicy are really great stuff!

Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm? 
You meann the last nightly build?

Thanks


Noble Paul നോബിള്‍ नोब्ळ् wrote:
 
 looks like a bug w/ DIH with the recent fixes.
 --Noble
 
 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything
 was
 working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the
 patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
looks like a bug w/ DIH with the recent fixes.
--Noble

On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Hey there,
 I was using the Deduplication patch with Solr 1.3 release and everything was
 working perfectly. Now I upgraded to a nigthly build (20th december) to be
 able to use new facet algorithm and other stuff and DeDuplication is not
 working any more. I have followed exactly the same steps to apply the patch
 to the source code. I am geting this error:

 WARNING: Error reading data
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)


 ** END NESTED EXCEPTION **
 Last packet sent to the server was 202481 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
at
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
at
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
 logError
 WARNING: Exception while closing result set
 com.mysql.jdbc.CommunicationsException: Communications link failure due to
 underlying exception:

 ** BEGIN NESTED EXCEPTION **

 java.io.EOFException

 STACKTRACE:

 java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150)
at com.mysql.jdbc.ResultSet.realClose(ResultSet.java:6488

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess the indexing of a doc is taking too long (may be because of
the de-dup patch) and the resultset gets closed automaticallly (timed
out)
--Noble

On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 Donig this fix I get the same error :(

 I am going to try to set up the last nigthly build... let's see if I have
 better luck.

 The thing is it stop indexing at the doc num 150.000 aprox... and give me
 that mysql exception error... Without DeDuplication patch I can index 2
 milion docs without problems...

 I am pretty lost with this... :(


 Shalin Shekhar Mangar wrote:

 Yes I meant the 05/01/2008 build. The fix is a one line change

 Add the following as the last line of DataConfig.Entity.clearCache()
 dataSrc = null;



 On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
 marc.sturl...@gmail.comwrote:


 Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
 works? If the fix you did is not really big can u tell me where in the
 source is and what is it for? (I have been debuging and tracing a lot the
 dataimporthandler source and I I would like to know what the imporovement
 is
 about if it is not a problem...)

 Thanks!


 Shalin Shekhar Mangar wrote:
 
  Marc, I've just committed a fix which may have caused the bug. Can you
 use
  svn trunk (or the next nightly build) and confirm?
 
  On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
  noble.p...@gmail.com wrote:
 
  looks like a bug w/ DIH with the recent fixes.
  --Noble
 
  On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
 marc.sturl...@gmail.com
  wrote:
  
   Hey there,
   I was using the Deduplication patch with Solr 1.3 release and
  everything
  was
   working perfectly. Now I upgraded to a nigthly build (20th december)
 to
  be
   able to use new facet algorithm and other stuff and DeDuplication is
  not
   working any more. I have followed exactly the same steps to apply
 the
  patch
   to the source code. I am geting this error:
  
   WARNING: Error reading data
   com.mysql.jdbc.CommunicationsException: Communications link failure
 due
  to
   underlying exception:
  
   ** BEGIN NESTED EXCEPTION **
  
   java.io.EOFException
  
   STACKTRACE:
  
   java.io.EOFException
  at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
  at
  
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
  at
  
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
  at
  
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
  at
  
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  
  
   ** END NESTED EXCEPTION **
   Last packet sent to the server was 202481 ms ago.
  at
 com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
  at
  com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
  at
 com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
  at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
  at
  
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
works? If the fix you did is not really big can u tell me where in the
source is and what is it for? (I have been debuging and tracing a lot the
dataimporthandler source and I I would like to know what the imporovement is
about if it is not a problem...)

Thanks!


Shalin Shekhar Mangar wrote:
 
 Marc, I've just committed a fix which may have caused the bug. Can you use
 svn trunk (or the next nightly build) and confirm?
 
 On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
 noble.p...@gmail.com wrote:
 
 looks like a bug w/ DIH with the recent fixes.
 --Noble

 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Hey there,
  I was using the Deduplication patch with Solr 1.3 release and
 everything
 was
  working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
  able to use new facet algorithm and other stuff and DeDuplication is
 not
  working any more. I have followed exactly the same steps to apply the
 patch
  to the source code. I am geting this error:
 
  WARNING: Error reading data
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  java.io.EOFException
 
  STACKTRACE:
 
  java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
  ** END NESTED EXCEPTION **
  Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  Jan 5, 2009 10:06:16 AM
 org.apache.solr.handler.dataimport.JdbcDataSource
  logError
  WARNING: Exception while closing result set
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm?

On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

 looks like a bug w/ DIH with the recent fixes.
 --Noble

 On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese marc.sturl...@gmail.com
 wrote:
 
  Hey there,
  I was using the Deduplication patch with Solr 1.3 release and everything
 was
  working perfectly. Now I upgraded to a nigthly build (20th december) to
 be
  able to use new facet algorithm and other stuff and DeDuplication is not
  working any more. I have followed exactly the same steps to apply the
 patch
  to the source code. I am geting this error:
 
  WARNING: Error reading data
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  java.io.EOFException
 
  STACKTRACE:
 
  java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
 
 
  ** END NESTED EXCEPTION **
  Last packet sent to the server was 202481 ms ago.
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
 at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
 at
 com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
 at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
 at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
 at
 
 org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
 at
 
 org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
 at
 
 org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
 at
 
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
 at
 
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
 at
 
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
  Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
  logError
  WARNING: Exception while closing result set
  com.mysql.jdbc.CommunicationsException: Communications link failure due
 to
  underlying exception:
 
  ** BEGIN NESTED EXCEPTION **
 
  java.io.EOFException
 
  STACKTRACE:
 
  java.io.EOFException
 at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
 at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
 at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771

  1   2   >