[tesseract-ocr] Whitelisting apostrophes problem

Chris H Mon, 03 Apr 2017 23:59:16 -0700

I am having trouble whitelisting and OCRing apostrophes (English single 
right quotes).
Given something like the attached image, without specifying a whitelist, 
apostrophes are output:


$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout
Doctor‘s Mask

But due to noise (not necessarily on that test image), I have tried 
implementing a whitelist with letters and numbers, as well as a hyphen, 
comma, and quotes (you can see my many attempts at apostrophes):

$ cat .config 
tessedit_char_whitelist 
-",'\'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890\u0027\u2019

The apostrophe doesn't come out:
$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout ./.config 
Doctors Mask

Arch Linux, up to date as of today
tesseract 3.05.00
 leptonica-1.74
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.29 : libtiff 
4.0.7 : zlib 1.2.11 : libwebp 0.5.2

Please suggest.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/264cfcae-ef46-4209-a6dd-2653f9547fc6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Whitelisting apostrophes problem

Reply via email to