>>
>> I have my own duplication system to detect that but I use String 
>> comparison
>> so it works really slow...
>>  
What are you doing for the String comparison? Not exact right?

hey,
My comparison method looks for similar (not just exact)... what I do is to
compare two text word to word. What I do after is decide a % of similarity,
fore example:
aaa sss ddd fff ggg hhh jjj kkk lll ooo
bbb rrr ddd fff ggg hhh jjj kkk lll ooo

Deciding a 80% of similarity and comparing word to word these two String
would be similar. (I split texts in tokens and count how many similars I do
have). 
(I use some stopwords and rules aswell)

I am going to try more tunning in the parameters of TextProfileSignature as
you say.
Don't know if you remember but I ask you about this in the ApacheConn and
you told me abou this 799 JIRA. If i make it word it is definitely much
faster than my system...

Abou deduplication... I couldn't find anywhere the classe tha aperas in the
wiki :org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory
so I downloaded the patch and pluedg in to my solr source (I use
org.apache.solr.update.processor.TextProfileSignature insted of the one
writed in the wiki). 

Would apreciate any advice about the tuning params of TextProfileSignature

Thank you for your time



markrmiller wrote:
> 
> 
>>>
>>> I have my own duplication system to detect that but I use String 
>>> comparison
>>> so it works really slow...
>>>  
> What are you doing for the String comparison? Not exact right?
> 
> 
-- 
View this message in context: 
http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20560828.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to