Andy Dills wrote:

> For what it's worth, the fuzzyocr hashing is of very limited value, and in
> many cases is a severe performance hit. I found that scanning the hashes,
> due to the "fuzzy" nature, is more costly than just rescanning the file
> with OCR, as *each* *and* *every* hash must be checked iteratively.
>
> Because of the "fuzzy" nature, you can't just check the db to "see if this
> hash exists." You have to go through and compare the generated hash to
> every hash in the db, and it considers it a match if it's "close enough".
>
> It's severely less computationally expensive to just rescan the damn
> image. It won't matter if you only get a couple hundered emails per day,
> but once the number of stored hashes reaches a reasonably low number, it
> becomes faster to rescan the image than to go through every single stored
> hash to see if you've already scanned a similar image.

I fully agree. When a fuzzyocr caching database grows beyond certain (small) 
size, it becomes a severe penaly, costlier than rescanning images.


snowcrash wrote:
> i'd be interested in what, then, the 'goal' of the hashing/comparison *is*?
> is it performance, and it just missed the mark for the reasons you
> state?  or is it something else?

The desired goal was no doubt performance increase,
but the implementation made it into a performance drag.

A possible compromise is to ditch the fuzzyocr database every couple of days,
and let it be started anew. This does bring some (limited) benefits.

  Mark

Reply via email to