I remember that part of the training wiki... and I wondered how it would affect such as small subset of characters. I only have 10 different digits... what kind of text am I supposed to write in the sample files, considering my valid inputs of only sequences of numbers? And the samples contain all those different handwritings from different people as well... should I separate different handwriting styles into different sample files instead of merging them all together? i.e. treat them like different fonts for the same language? (though it would be extremely limiting, considering the current limit of 64 fonts per language)
On Saturday, March 1, 2014 9:02:41 PM UTC+8, Quan Nguyen wrote: > > I would go by what is suggested by the training > wiki<https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3> > : > > *Don't make the mistake of grouping all the non-letters together. Make the > text more realistic.* > > I think you can improve the result a little bit by merging your images > into a multi-page TIFF and concatenating your box files (make sure the page > numbers are correct). However, that still does not meet the suggestion > stated above. > > On Friday, February 28, 2014 10:20:11 PM UTC-6, Frederico Ferro Schuh > wrote: >> >> Do you think training one character per file is affecting my results? >> >> I was doing it because I have thousands of samples, and makebox always >> makes too many wrong guesses. If I have all the digits on the same image, >> fixing the resulting 10k chars box file manually would take forever. On the >> other hand, fixing a single digit box file only takes a simple regexp >> replace operation on the resulting box file (one replace for digit 1, >> another replace for digit 2, and so on). >> >> Also, the goal of my application is for online OCR, to recognize single >> lines of handwritten digits as the user draws them. Would this affect the >> format of my sample image(s) as well? >> >> Thanks, >> Fred >> >> >> On Friday, February 28, 2014 10:58:05 PM UTC+8, Quan Nguyen wrote: >>> >>> I'm not sure having only samples of one character in a file is a good >>> idea. I normally train with all the characters in the same image(s). >>> >>> Check >>> http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-2.01.eng.tar.gzfor >>> an example. >>> >>> On Tuesday, February 25, 2014 10:51:39 AM UTC-6, Frederico Ferro Schuh >>> wrote: >>>> >>>> Hello all, >>>> >>>> I'm training Tesseract to recognize handwritten digits, and I have >>>> provided it about 6000 samples of each digit, in 10 different box files, >>>> one for each digit. Each box file is a 2152x2152 TIF file. However, the >>>> resulting traineddata file I get after completing the training procedure >>>> is >>>> only 137 kb. >>>> I went through the process again, providing smaller sample files (1000 >>>> samples of each digit), and ended up with the same traineddata size of 137 >>>> kb. >>>> Is this size reasonable or am I doing something wrong? >>>> I assume something is wrong because my results are pretty bad so far. >>>> >>>> I've attached the sample image I am using for the digit 0. >>>> >>>> Thanks in advance, >>>> Fred >>>> >>> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

