>> >> I have my own duplication system to detect that but I use String >> comparison >> so it works really slow... >> What are you doing for the String comparison? Not exact right?
hey, My comparison method looks for similar (not just exact)... what I do is to compare two text word to word. What I do after is decide a % of similarity, fore example: aaa sss ddd fff ggg hhh jjj kkk lll ooo bbb rrr ddd fff ggg hhh jjj kkk lll ooo Deciding a 80% of similarity and comparing word to word these two String would be similar. (I split texts in tokens and count how many similars I do have). (I use some stopwords and rules aswell) I am going to try more tunning in the parameters of TextProfileSignature as you say. Don't know if you remember but I ask you about this in the ApacheConn and you told me abou this 799 JIRA. If i make it word it is definitely much faster than my system... Abou deduplication... I couldn't find anywhere the classe tha aperas in the wiki :org.apache.solr.update.processor.DeduplicateUpdateProcessorFactory so I downloaded the patch and pluedg in to my solr source (I use org.apache.solr.update.processor.TextProfileSignature insted of the one writed in the wiki). Would apreciate any advice about the tuning params of TextProfileSignature Thank you for your time markrmiller wrote: > > >>> >>> I have my own duplication system to detect that but I use String >>> comparison >>> so it works really slow... >>> > What are you doing for the String comparison? Not exact right? > > -- View this message in context: http://www.nabble.com/TextProfileSigature-using-deduplication-tp20559155p20560828.html Sent from the Solr - User mailing list archive at Nabble.com.