Marc Sturlese wrote:
Hey there, I've been testing and checking the source of the
TextProfileSignature.java to avoid similar entries at indexing time.
What I understood is that it is useful for huge text where the frequency of
the tokens (the words in lowercase just with number and leters in taht case)
is important. If you want to detect duplicates in not huge text and not
giving a lot of importance to the frequencies it doesn't work...
The hash will be made just with the terms wich frequency is higher than a
QUANTUM (which value is given in function of the max freq between all the
terms). So it will say that:

aaa sss ddd fff ggg hhh aaa kkk lll ooo
aaa xxx iii www qqq aaa jjj eee zzz nnn

are duplicates because quantum here wolud be 2 and the frequency of aaa
would be 2 aswell. So, to make the hash just the term aaa would be used.

In this case:
aaa sss ddd fff ggg hhh kkk lll ooo
apa sss ddd fff ggg hhh kkk lll ooo

Here quantum would be 1 and the frequencies of all terms would be 1 so all
terms would be use for the hash. It will consider this two strings not
similar.

As I understood the algorithm there's no way to make it understand that in
my second case both strings are similar. I wish i were wrong...

I have my own duplication system to detect that but I use String comparison
so it works really slow... Would like to know if there is any tuning
possibility to do that with TextProfileSignature
Don't know if I should pot this here or in the developers forum...

Hi Marc,

TextProfileSignature is a rather crude implementation of approximate similarity, and as you pointed out it's best suited for large texts. The original purpose of this Signature was to deduplicate web pages in large amounts of crawled pages (in Nutch), where it worked reasonably well. Its advantage is also that it's easy to compute and doesn't require multiple passes over the corpus.

As it is implemented now, it breaks badly in the case you describe. You could modify this implementation to include also word-level ngrams, i.e. sequences of more than 1 word, up to N (e.g. 5) - this should work in your case.

Ultimately, what you are probably looking for is a shingle-based algorithm, but it's relatively costly and requires multiple passes.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to