Thanks for the information. Actually, the original process I did follow the wiki instruction closely. The example given above is just an example to illustrate the problem I faced. The original process I have taken looks like this. Training text (chi.ming.exp0.txt): - from *http://ash.jp/code/cn/big5tbl.htm*<http://ash.jp/code/cn/big5tbl.htm> - remove the row and coloun headers, symbols not needed - add common punctuations - join lines - repeat multiple times - convert to UTF-8 without BOM using Notepad++
Training TIF and Box generation (chi.ming.exp0.tif and chi.ming.exp0.box): - use jTessBoxEditor - "ming" font, regular, 24pt Training process - use Tesseract-OCR 3.02 portable version for Windows - command: ..\Tesseract-OCR\tesseract chi.ming.exp0.tif chi.ming.exp0 batch.nochop box.train Output - long list of messages - a partial list is attached in "partial messages from page 1 of 9.txt" Files: - chi.ming.exp0.txt [*https://docs.google.com/file/d/0Bz99K1Qj2HQ_TkdUNmJYTDF1V00/edit*<https://docs.google.com/file/d/0Bz99K1Qj2HQ_TkdUNmJYTDF1V00/edit> ] - chi.ming.exp0.tif [*https://docs.google.com/file/d/0Bz99K1Qj2HQ_SVZ3QlpDczRLVW8/edit*<https://docs.google.com/file/d/0Bz99K1Qj2HQ_SVZ3QlpDczRLVW8/edit> ] - chi.ming.exp0.box [*https://docs.google.com/file/d/0Bz99K1Qj2HQ_RnBWejJWUVdFUGc/edit*<https://docs.google.com/file/d/0Bz99K1Qj2HQ_RnBWejJWUVdFUGc/edit> ] - partial messages from page 1 of 10.txt [*https://docs.google.com/file/d/0Bz99K1Qj2HQ_Z0gwVmY2OFJtTkk/edit*<https://docs.google.com/file/d/0Bz99K1Qj2HQ_Z0gwVmY2OFJtTkk/edit> ] Thanks a lot. Regards, W. K. Lo On Sunday, February 24, 2013 4:37:09 AM UTC+8, zdenop wrote: > Your input image do not follow training wiki[1] so your result is failure > (yes, you can fail to train tesseract even you follow wiki ;-), but if you > do not follow it, you can be sure you will fail especially if you have no > experience with tesseract training) > > [1] > http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images > > Zdenko > > > On Tue, Feb 19, 2013 at 7:29 AM, W. K. LO <[email protected] > <javascript:>>wrote: > >> I have problem using tesseract in training using character image. >> Examples of the problem is described as follows. >> Box and Tif files are attached. >> Box: https://docs.google.com/file/d/0Bz99K1Qj2HQ_dkZKUW5RdDU1Tk0/edit >> Tif: https://docs.google.com/file/d/0Bz99K1Qj2HQ_WkJqOHI0OHU3Nnc/edit >> >> Case 1: >> ===command=== >> tesseract test.ming.24.tif test.ming.24 batch.nochop box.train >> >> ===output message=== >> Tesseract Open Source OCR Engine v3.02 with Leptonica >> Empty page!! >> Empty page!! >> >> Case 2: Telling Tesseract there is only one single character >> ===command=== >> .\tesseract test.ming.24.tif test.ming.24 -psm 10 batch.nochop box.train >> >> ===output message=== >> Tesseract Open Source OCR Engine v3.02 with Leptonica >> Bounding box=(16,23)->(28,32) >> Bounding box=(16,15)->(28,24) >> APPLY_BOXES: boxfile line 0/??((8,14),(36,41)): FAILURE! Couldn't find a >> matchin >> g blob >> APPLY_BOXES: >> Boxes read from boxfile: 1 >> Boxes failed resegmentation: 1 >> APPLY_BOXES: Unlabelled word at :Bounding box=(16,15)->(28,32) >> APPLY_BOXES: Unlabelled word at :Bounding box=(8,14)->(36,41) >> Found 0 good blobs. >> 2 remaining unlabelled words deleted. >> Generated training data for 0 words >> >> Any options needed to be specified to make it work? >> >> Thanks a lot. >> >> Regards, >> W. K. Lo >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

