I'm aware of Edit Distance and bit-tap algorithm's use in phishing detection, but the challenge is that low edit-distance between two words doesn't guarantee visual similarity. for example - paypal and laypas.
So, I've been considering converting the target text into image, may be apply a filter or two, and OCR all possible words along with the likely hood. Is that possible with tesseract? On Wednesday, June 8, 2016 at 1:56:42 AM UTC-7, Bojidar Stanchev wrote: > > Tesseract is mostly used to recognize text from images. > > From what I understand you want to protect yourself from phishing. > A very good way to do that is to familiarize yourself with Levenshtein > distance algorithm. > It's very simple - it calculates how many changes you need to make to a > string to get to the desired string. > For example if you have paiipal and compare it to paypal it will give you > a distance of 3 - remove 2 letters and add 1. > > Why am I suggesting this - because your problem has already been solved in > a slightly different situation - corporate world. > Sometimes a bad employee in a company would try to switch the company name > on a document with the same name but 2 letters are swapped for example, > small alterations like this are hard to notice for a human, like you > pointed out, but for a machine is very easy. > > I hope this helps, if not, maybe I did not fully understand your > intentions and you would have to clarify why you need to use Tesseract so I > can further help you. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f63c6d3-dbb3-4b81-8e20-80b4e3d8052d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

