I've been training OCR to recognize many characters spread throughout unicode definition. I found this handy webapp to be invaluable in understanding what are some of the "unprintable" unicode characters.
I can copy/paste the character into the top left text area and hit convert. I am mainly interested in the "UTF-16 code units" text area on the lower right side of the page, since these are the codes I'm using with Tesseract. http://rishida.net/scripts/uniview/conversion.php If I don't recognize the UTF-16 (which is less frequent now that I've stared at them so much), then I can click the "View in Uniview" which is above the top left text area. This will pop-up another web page which 99% of the time gives me a printable view of the unicode character. Hope it helps! PS: Does anyone know of a single font which is capable of drawing ALL unicode characters defined by unicode.org? Currently, I'm using MS Arial Unicode which does a halfway decent job, but it isn't complete. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

