Re: TextProfileSigature using deduplication

Andrzej Bialecki Tue, 18 Nov 2008 06:40:19 -0800

Marc Sturlese wrote:

Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:


aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning

possibility to do that with TextProfileSignature

Don't know if I should pot this here or in the developers forum...


Hi Marc,

TextProfileSignature is a rather crude implementation of approximatesimilarity, and as you pointed out it's best suited for large texts. Theoriginal purpose of this Signature was to deduplicate web pages in largeamounts of crawled pages (in Nutch), where it worked reasonably well.Its advantage is also that it's easy to compute and doesn't requiremultiple passes over the corpus.

As it is implemented now, it breaks badly in the case you describe. Youcould modify this implementation to include also word-level ngrams, i.e.sequences of more than 1 word, up to N (e.g. 5) - this should work inyour case.

Ultimately, what you are probably looking for is a shingle-basedalgorithm, but it's relatively costly and requires multiple passes.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: TextProfileSigature using deduplication

Reply via email to