Andy Dills wrote: > For what it's worth, the fuzzyocr hashing is of very limited value, and in > many cases is a severe performance hit. I found that scanning the hashes, > due to the "fuzzy" nature, is more costly than just rescanning the file > with OCR, as *each* *and* *every* hash must be checked iteratively. > > Because of the "fuzzy" nature, you can't just check the db to "see if this > hash exists." You have to go through and compare the generated hash to > every hash in the db, and it considers it a match if it's "close enough". > > It's severely less computationally expensive to just rescan the damn > image. It won't matter if you only get a couple hundered emails per day, > but once the number of stored hashes reaches a reasonably low number, it > becomes faster to rescan the image than to go through every single stored > hash to see if you've already scanned a similar image.
I fully agree. When a fuzzyocr caching database grows beyond certain (small) size, it becomes a severe penaly, costlier than rescanning images. snowcrash wrote: > i'd be interested in what, then, the 'goal' of the hashing/comparison *is*? > is it performance, and it just missed the mark for the reasons you > state? or is it something else? The desired goal was no doubt performance increase, but the implementation made it into a performance drag. A possible compromise is to ditch the fuzzyocr database every couple of days, and let it be started anew. This does bring some (limited) benefits. Mark