Dear all, I would like to know if there is/are option(s) for controlling the segmentation process during OCR. I am playing with a Chinese OCR and find that the segmentation is affected by neighbouring characters. I would like to try playing with the parameters/options to control the processes. An example is given as follows: test01 ====== test01.tif [https://docs.google.com/file/d/0Bz99K1Qj2HQ_anJKQXN4RTlmLXc/edit] command: tesseract test01.tif test01 -l chi makebox result: 4th, 5th, 8th characters are broken apart test01.box [https://docs.google.com/file/d/0Bz99K1Qj2HQ_aFQ2ekpVMy0wTWM/edit] screen of test01 segmentaion [https://docs.google.com/file/d/0Bz99K1Qj2HQ_UmlUVFNLT0paZjA/edit]
test02 ====== test02.tif [https://docs.google.com/file/d/0Bz99K1Qj2HQ_NVF6ZzhvSXBQZnc/edit] edit: remove first 3 characters of test01.tif command: tesseract test02.tif test02 -l chi makebox result: all characters are correctly segmented (only mixed up a punctuation mark) test02.box [https://docs.google.com/file/d/0Bz99K1Qj2HQ_N04zM1V2T2xvNWs/edit] screen of test02 segmentaion [https://docs.google.com/file/d/0Bz99K1Qj2HQ_QXNKcGNzU3NxMDg/edit] test03 ====== test03.tif [https://docs.google.com/file/d/0Bz99K1Qj2HQ_cjZCbVE3ZVNWOUU/edit] edit: replace the 2nd last character of test02.tif command: tesseract test03.tif test03 -l chi makebox result: 1st, 2nd and 5th characters are broken apart test03.box [https://docs.google.com/file/d/0Bz99K1Qj2HQ_TVRzMXpmTDlwR00/edit] screen of test03 segmentaion [https://docs.google.com/file/d/0Bz99K1Qj2HQ_UDlyOGIxU091SnM/edit] It seems that the combination in test02 favour tesseract's default setting. I would like to try if there are parameters/options for me to play around to control the segmentation process. Thanks. Regards, W. K. Lo -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

