My original image file was 12600 × 6670 big, wich is within the max. of 7FFFx7FFF and contains about 7000 diff. chars. Which is kinda small according to the wiki which says you should have 10th of thousand of chars for large char sets. Anyway I chopped this big image in smaller once of 500chars per image, I trained tesseract with all these pieces, and the above mentioned error just came up for piece. So I continued, thinking that losing 500Chars is better then not be able to train anything. I merged the remaining .tr files and .box files to one big .tr and .box file.
Everything goes fine till I need to use mftraining. The command: mftraining -F font_properties -U unicharset -O jpn.unicharset jpn.fontname.exp0.tr cause "Error: Illegal short name for a feature!" I tried this step with many different training images, the error appears always regardless of the size of the .tr file. I should mention that till now i was working on MacOS Lion. I tried the whole thing again on my Windows System, where i also get the "Assertion failed" error but not the "Error: Illegal short name for a feature!" instead of that mftraining just crashes when the .tr file is to big( in my case 45MB). It crashes in SetUpForFloat2Int(unicharset_training, ClassList), I also noticed that before it crashes the Microfeat file is written, when this file is below 32MB the program continues normally, but is this file bigger than 32MB it crashs. Looks like a too small variable. Is this bug known? Have others encounter similar problems with large char sets? Is there actualy an official char limit? I saw that the original jpn.tessdata can regocnize about 4000 different chars, are 7000 too much? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

