Dear all,
 
I would like to know if there is/are option(s) for controlling the 
segmentation process during OCR.
 
I am playing with a Chinese OCR and find that the segmentation is affected 
by neighbouring characters. I would like to try playing with the 
parameters/options to control the processes. An example is given as follows:
 
test01
======
test01.tif 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_anJKQXN4RTlmLXc/edit]
command: tesseract test01.tif test01 -l chi makebox
result: 4th, 5th, 8th characters are broken apart
test01.box 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_aFQ2ekpVMy0wTWM/edit]
screen of test01 
segmentaion [https://docs.google.com/file/d/0Bz99K1Qj2HQ_UmlUVFNLT0paZjA/edit]

test02
======
test02.tif 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_NVF6ZzhvSXBQZnc/edit]
edit: remove first 3 characters of test01.tif
command: tesseract test02.tif test02 -l chi makebox
result: all characters are correctly segmented (only mixed up a punctuation 
mark)
test02.box 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_N04zM1V2T2xvNWs/edit]
screen of test02 segmentaion 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_QXNKcGNzU3NxMDg/edit]

test03
======
test03.tif 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_cjZCbVE3ZVNWOUU/edit]
edit: replace the 2nd last character of test02.tif
command: tesseract test03.tif test03 -l chi makebox
result: 1st, 2nd and 5th characters are broken apart
test03.box 
[https://docs.google.com/file/d/0Bz99K1Qj2HQ_TVRzMXpmTDlwR00/edit]
screen of test03 
segmentaion [https://docs.google.com/file/d/0Bz99K1Qj2HQ_UDlyOGIxU091SnM/edit]
It seems that the combination in test02 favour tesseract's default setting. 
I would like to try if there are parameters/options for me to play around 
to control the segmentation process.
 
 
Thanks.
 
Regards,
W. K. Lo

 

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to