I've been training OCR to recognize many characters spread throughout
unicode definition.
I found this handy webapp to be invaluable in understanding what are
some of the "unprintable" unicode characters.

I can copy/paste the character into the top left text area and hit
convert.
I am mainly interested in the "UTF-16 code units" text area on the
lower right side of the page, since these are the codes I'm using with
Tesseract.
http://rishida.net/scripts/uniview/conversion.php

If I don't recognize the UTF-16 (which is less frequent now that I've
stared at them so much), then I can click the "View in Uniview" which
is above the top left text area. This will pop-up another web page
which 99% of the time gives me a printable view of the unicode
character.

Hope it helps!


PS: Does anyone know of a single font which is capable of drawing ALL
unicode characters defined by unicode.org? Currently, I'm using MS
Arial Unicode which does a halfway decent job, but it isn't complete.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to