I think I got around it. I wasn't copying over the word-dawg and freq- dawg files from another language or generating them. I just touched empty files with the same names. Sorry for the trouble.
Ray, could the training document be clarified so that when it mentions "You must create inttemp, normproto, pfftable and unicharset using the procedure described below", that it also says the *-dawf files need to be copied over or generated? I know it says that not using the *-dawg files can result in lower accuracy, my experience is that tess asks for them or won't run. Also, since I'm reading in nucleic acid strings which don't have word sequences, would not using a dictionary actually increase accuracy? Thanks, Matt On Jun 1, 6:20 pm, Matt Chan <[email protected]> wrote: > Hi, > > I'm training tesseract to recognize only a small subset of english > letters (A, C, T, G, U) for pulling nucleic acid sequences out of > journal publications. > > I'm having trouble with one paper because the font joins 'A's when > they are consecutive. I've tried creating boxes which break the joined > 'AA' together, but tesseract gives me an error about having "box > overlaps blob in labelled word". > > I've managed to get around that by specifying 'AA' as a single letter > for those blobs, but I'm still having issues with a "Error: Illegal > malloc request size!" bug. I'm not sure if these are related to my > training process, or something else altogether. > > I'm hesitant to recompile because I'm moving the data files to a > closed-source program which uses a tesseract back-end. > > I can give more details if necessary. > > Thanks in advance for any replies. > Matt --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

