I am creating an automatic trainer for tesseract. However, I am having some 
issues in that Tesseract is having trouble with long and thin characters 
when they are placed at the end of a word.

For instance, "XLSAjasLi", tesseract will fail at "i" at the end. I am 
using tessseract 3.02.

The following is the box coordinate for the error : 
2 73 29876 101 29926 0
g 105 29876 133 29926 0
K 137 29876 167 29926 0
8 171 29876 199 29926 0
f 203 29876 225 29926 0
s 229 29876 257 29926 0
K 261 29876 291 29926 0
5 295 29876 323 29926 0
l 327 29876 344 29926 0
I 348 29876 365 29926 0 -- Error

I've also attached a cropping of the multipage tiff file that was created. 
Note: The rectangular boxes are not on the images originally, they were 
added to debug the image coordinates. I did not train Tesseract on the 
image with the rectangular boxes.
The multitiff page is two tiff pages with each one being around 36000x449 
pixels big.

The specific error from tesseract training command: 
./tesseract-install/bin/tesseract 
OCR_Trainer_Output/tests/TestLargeImageBW.tiff test.arial.exp0 nobatch 
box.train 
FAIL!
APPLY_BOXES: boxfile line 349/I ((348,29876),(365,29926)): FAILURE! 
Couldn't find a matching blob

This is just one of many. Basically tesseract deterministically fails 
whenever either "I", "i", "j", "l" are at the very end of a text.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b8ca51bc-20d8-493f-ba21-f4b0252a9ccd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to