[tesseract-ocr] Re: Tesseract for Phishing detection

Jack D Wed, 06 Jul 2016 12:41:51 -0700

I'm aware of Edit Distance and bit-tap algorithm's use in phishing 
detection, but the challenge is that low edit-distance between two words 
doesn't guarantee visual similarity. for example - paypal and laypas.


So, I've been considering converting the target text into image, may be 
apply a filter or two, and OCR all possible words along with the likely 
hood. Is that possible with tesseract?

On Wednesday, June 8, 2016 at 1:56:42 AM UTC-7, Bojidar Stanchev wrote:
>
> Tesseract is mostly used to recognize text from images.
>
> From what I understand you want to protect yourself from phishing.
> A very good way to do that is to familiarize yourself with Levenshtein 
> distance algorithm.
> It's very simple - it calculates how many changes you need to make to a 
> string to get to the desired string.
> For example if you have paiipal and compare it to paypal it will give you 
> a distance of 3 - remove 2 letters and add 1.
>
> Why am I suggesting this - because your problem has already been solved in 
> a slightly different situation - corporate world.
> Sometimes a bad employee in a company would try to switch the company name 
> on a document with the same name but 2 letters are swapped for example,
> small alterations like this are hard to notice for a human, like you 
> pointed out, but for a machine is very easy.
>
> I hope this helps, if not, maybe I did not fully understand your 
> intentions and you would have to clarify why you need to use Tesseract so I 
> can further help you.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f63c6d3-dbb3-4b81-8e20-80b4e3d8052d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Tesseract for Phishing detection

Reply via email to