For your purposes a simple approach will yield the best results. The reason it is recommended to repeat letters is because tesseract does not train or read well with small samples due to its approximation/heuristic methods. As tesseract processes the image it improves apon itself and then takes a second pass. These benefits are lost once you are scanning another example. I have gotten the best results by making more than just one scan. How do you do this? By repeating the same image in subsequent pages of the same tiff. Then I only look at the last page data.
On Mon, Oct 22, 2012 at 12:06 AM, Adam Chapam <[email protected]>wrote: > @ Andres > I am afraid i do not know the answer to your question, having only looked > into the internals of tesseract since last week. My followup email was > purely based on an afternoon of unscientific trial and error, but i am > interested enough to do further research and will post anything useful that > i find. > > @ Nick > > I am sure more windows based tools can only be a good thing. I wrote mine > from scratch as a learning process as much as anything, and also so i can > easily compare training results (generate text > render > train > do OCR > > compare output.txt to generated). If get the time i will clean it up and > comment the source, so it can be released for others. > > I imagine the increasing demand for windows based tools is in part due to > the success of the various .net wrappers that make integrating tesseract > so trivial. > > As a side project i will work on my text generation algorithm to produce > more realistic text (capitals at the start of sentences, punctuation etc) > > Your point about monospace font is interesting. In order to avoid bounding > box overlaps, i am artificially creating mono spaced output regardless of > font. I wonder if relative spacing would be better. > > On Sunday, 21 October 2012 10:55:51 UTC+1, Nick White wrote: >> >> Hi Adam, >> >> Thanks for writing with so much detail. Was interesting to read. >> >> On Fri, Oct 19, 2012 at 02:22:44AM -0700, Adam Chapam wrote: >> > I can follow the training wiki and produce working traineddata files, >> and have >> > written a .net app to automate creating tif/box pairs from a font file, >> (i know >> > there are plenty of other tools out there, but i have no desire to boot >> into >> > linux or learn python just for this) >> >> OK. Someday I'll get my C program cross-compiling with Windows, and >> then it will be usable there too. >> >> > The training wiki suggests that abcdefghijklmnopqrstuvwxyz1234**567890 >> would be a >> > terrible training text, and i presume this is because it needs to learn >> > baseline metrics and other such things >> >> Yes, I think the metrics etc are the main reason for having a >> 'realistic' training image. In your case going for semi- random >> strings of the type you expect to see (as you explained in your >> followup email) sounds like a sensible solution, and I can't see any >> potential issues with it. >> >> > The other thing that confused me was the need to have x many >> representations of >> > a character in the training text. If using scanned images >> > with inevitable small variances between the same characters, that makes >> sense, >> > but using digitally rendered tiffs, they will all be exactly the same, >> so what >> > benefit is there of repeating a character? Is the frequancy used to >> determine >> > between similar characters later on, eg : >> > This letter could be an O or a D. The letter D occurred 20 times in >> training, >> > but O only appeared 7 times, so therefore D is the most likely outcome? >> >> As far as I'm aware the character frequency isn't used this way. I >> actually think it would be interesting to be able to specify how >> common a character is generally, but I don't think frequency in a >> training text would be a sensible way to specify it. >> >> As for the need to have multiple representations of a character, you >> are right that you gain less from this when using straight digitally >> generated characters. There is probably still some benefit to be had >> in using several samples, to get more accurate metrics for its >> position relative to the baseline and other letters. Less relevant >> for a monospace font, though. >> >> Hopefully I've answered all your questions somewhat. Let me know if >> I missed anything. >> >> Nick >> > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

