Re: How to run the solr dedup for the document which match 80% or match almost.

2012-01-02 Thread vibhoreng04
Hi,

I implemented TextProfileSignature dedupe as suggested but here is something
weired which I came through while implementing -
I am testing it with two documents and trying to index them .

Please see the below content-

Content starts Here
I bought a Toyota Camry in 2007. After driven 6km, Test02 my engine oil
light starts flash after change engine oil and just drive 5000Km during I
use brake. I went to Toyota to ask a , it is said the normal engine Test03
oil consumption is 0.4 to 0.5L/1000Km. Test04 If so, Toyota recommends
6000Km for each engine oil change. If so, after driving 6000Km,Test05 the
engine oil consumption is 3Litre. But each time, the dealer just put 4 Litre
oil in. That means there is just 1 Litre in engine after driving
6000Km. Test06 Does anybody have standard engine oil consumption? As I
searched, even in some undeveloped countries, it is just 0.3Litre/1000Km.
Content ends Here


If i keep on adding test words like --- test01 test02 test03 in the second
document,and so on,solr still recognizes the second document as the
duplicate.But if I add any of the test word more than once(test11 or test07)
,the document count becomes 2 and the dedupe doesn't works after that.

1)Is this the default behavior or is there something to fix?

2)Can you please also tell me what is the threshold limit for dedupe?

3) Q/UANT = QUANT_RATE * maxFreq, where  QUANT_RATE is 0.01f by default, and
maxFreq is the maximum token  frequency. If maxFreq is higher than 1, then
QUANT is always higher  than 2/

Can you please clarify the above given explanation? I mean to say is
QUANT_RATE=.01f and f is less than 100 ,then how Quant rate is an integer?


Regards,

Vibhor


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3626526.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to run the solr dedup for the document which match 80% or match almost.

2012-01-02 Thread vibhoreng04
Hi Lance,

This is out of context but still asking you the question .

I implemented TextProfileSignature dedupe as suggested but here is something
weired which I came through while implementing -
I am testing it with two documents and trying to index them .

Please see the below content-

Content starts Here
I bought a Toyota Camry in 2007. After driven 6km, Test02 my engine oil
light starts flash after change engine oil and just drive 5000Km during I
use brake. I went to Toyota to ask a , it is said the normal engine Test03
oil consumption is 0.4 to 0.5L/1000Km. Test04 If so, Toyota recommends
6000Km for each engine oil change. If so, after driving 6000Km,Test05 the
engine oil consumption is 3Litre. But each time, the dealer just put 4 Litre
oil in. That means there is just 1 Litre in engine after driving 6000Km.
Test06 Does anybody have standard engine oil consumption? As I searched,
even in some undeveloped countries, it is just 0.3Litre/1000Km.
Content ends Here


If i keep on adding test words like --- test01 test02 test03 in the second
document,and so on,solr still recognizes the second document as the
duplicate.But if I add any of the test word more than once(test11 or test07)
,the document count becomes 2 and the dedupe doesn't works after that.

1)Is this the default behavior or is there something to fix?

2)Can you please also tell me what is the threshold limit for dedupe?

3) QUANT = QUANT_RATE * maxFreq, where  QUANT_RATE is 0.01f by default, and
maxFreq is the maximum token  frequency. If maxFreq is higher than 1, then
QUANT is always higher  than 2

Can you please clarify the above given explanation? I mean to say is
QUANT_RATE=.01f and f is less than 100 ,then how Quant rate is an integer?


Regards,

Vibhor 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3628221.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-28 Thread Lance Norskog
You would have to implement this yourself in your indexing code. Solr
has an analysis plugin which does the analysis for your text and then
returns the result, but does not query or index. You can use this to
calculate the fuzzy hash, then search against index.

You might be able to code this in an UpdateRequestProcessor.

On Tue, Dec 27, 2011 at 9:45 PM, vibhoreng04 vibhoren...@gmail.com wrote:
 Hi Shashi,

 That's correct  !But I need something for index time comparision.Can cosine
 compare from the already indexed documents and compare the incrementally
 indexed files ?



 Regards,


 Vibhor

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3615787.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goks...@gmail.com


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread vibhoreng04
Hi iorixxx,

Thanks for the quick update.I hope I can take it from here !


Regards,

Vibhor

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread Ahmet Arslan
 I am doing dedup for my solr instance which works on the
 content and the url
 fields.My question is if I want to eliminate the records
 which are 80%
 matching or 90% matching in the content field then how I
 should proceed for
 that?
 Already I have changed my solrconfig.xml and have changed
 the part of file
 which is required for the dedup(update Request Processor
 chain) and that
 part is working fine.

You can use TextProfileSignature, which is a Fuzzy hashing implementation, 
instead of Lookup3Signature. 


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread Shashi Kant
You can also look at cosine similarity (or related metrics) to measure
document similarity.

On Tue, Dec 27, 2011 at 6:51 AM, vibhoreng04 vibhoren...@gmail.com wrote:
 Hi iorixxx,

 Thanks for the quick update.I hope I can take it from here !


 Regards,

 Vibhor

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread vibhoreng04
Hi Shashi,

That's correct  !But I need something for index time comparision.Can cosine
compare from the already indexed documents and compare the incrementally
indexed files ?



Regards,


Vibhor 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3615787.html
Sent from the Solr - User mailing list archive at Nabble.com.