Building new language with tesseract, characters touching

Matt Chan Mon, 01 Jun 2009 15:21:08 -0700

Hi,

I'm training tesseract to recognize only a small subset of english
letters (A, C, T, G, U) for pulling nucleic acid sequences out of
journal publications.


I'm having trouble with one paper because the font joins 'A's when
they are consecutive. I've tried creating boxes which break the joined
'AA' together, but tesseract gives me an error about having "box
overlaps blob in labelled word".

I've managed to get around that by specifying 'AA' as a single letter
for those blobs, but I'm still having issues with a "Error: Illegal
malloc request size!" bug. I'm not sure if these are related to my
training process, or something else altogether.

I'm hesitant to recompile because I'm moving the data files to a
closed-source program which uses a tesseract back-end.

I can give more details if necessary.

Thanks in advance for any replies.
Matt
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Building new language with tesseract, characters touching

Reply via email to