Hello everyone,
my purpose is to generate trainned data for the handwritting style of the 
same person (which seems similar in all documents that I have from him) in 
order to ocr large amounts of image text (translated books in his 
handwritting style).

    I followed the process of trainning described in the tesseract wiki. I 
used JtessBoxeditor to fix the box files generated by tesseract, and did 
the process of using the new language created using just 1 image sample to 
generate the subsequent box files with relative success (the occurance of 
mistaken characters in the box files seemed to be largelly reduced).
 - I used 4 tiff image files of real text  taken from notes (about 800 
characters per file)

However, even after this training process that took too long (the 
correction of box files was painfully slow), the results, even when I tried 
to ocr the same images used to do the training, was too poor. So, I have 
some questions:

1- Why, when generating the box files with tesseract, and checking the 
results with JTessBoxEditor, the accuracy seems very good (just a few 
mistaken characters), and when I use the same trained data to ocr the same 
image file, the text precision resultant seem very poor compared to the 
recognition done during the box generation?
2- Is there a chance to get better results by trainning with more samples?  
I already trained with more than 3200 characters samples, and I'm not sure 
if the proccess of training more samples will increase the precision of the 
recognition. 
3- I trainned the data with 4 tiff individual files. Is there any chance of 
getting better results by using a multi-page tif file to train tesseract 
instead of several files, boxes and *.tr files?

Thanks in advance

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to