@ Andres I am afraid i do not know the answer to your question, having only looked into the internals of tesseract since last week. My followup email was purely based on an afternoon of unscientific trial and error, but i am interested enough to do further research and will post anything useful that i find.
@ Nick I am sure more windows based tools can only be a good thing. I wrote mine from scratch as a learning process as much as anything, and also so i can easily compare training results (generate text > render > train > do OCR > compare output.txt to generated). If get the time i will clean it up and comment the source, so it can be released for others. I imagine the increasing demand for windows based tools is in part due to the success of the various .net wrappers that make integrating tesseract so trivial. As a side project i will work on my text generation algorithm to produce more realistic text (capitals at the start of sentences, punctuation etc) Your point about monospace font is interesting. In order to avoid bounding box overlaps, i am artificially creating mono spaced output regardless of font. I wonder if relative spacing would be better. On Sunday, 21 October 2012 10:55:51 UTC+1, Nick White wrote: > > Hi Adam, > > Thanks for writing with so much detail. Was interesting to read. > > On Fri, Oct 19, 2012 at 02:22:44AM -0700, Adam Chapam wrote: > > I can follow the training wiki and produce working traineddata files, > and have > > written a .net app to automate creating tif/box pairs from a font file, > (i know > > there are plenty of other tools out there, but i have no desire to boot > into > > linux or learn python just for this) > > OK. Someday I'll get my C program cross-compiling with Windows, and > then it will be usable there too. > > > The training wiki suggests that abcdefghijklmnopqrstuvwxyz1234567890 > would be a > > terrible training text, and i presume this is because it needs to learn > > baseline metrics and other such things > > Yes, I think the metrics etc are the main reason for having a > 'realistic' training image. In your case going for semi- random > strings of the type you expect to see (as you explained in your > followup email) sounds like a sensible solution, and I can't see any > potential issues with it. > > > The other thing that confused me was the need to have x many > representations of > > a character in the training text. If using scanned images > > with inevitable small variances between the same characters, that makes > sense, > > but using digitally rendered tiffs, they will all be exactly the same, > so what > > benefit is there of repeating a character? Is the frequancy used to > determine > > between similar characters later on, eg : > > This letter could be an O or a D. The letter D occurred 20 times in > training, > > but O only appeared 7 times, so therefore D is the most likely outcome? > > As far as I'm aware the character frequency isn't used this way. I > actually think it would be interesting to be able to specify how > common a character is generally, but I don't think frequency in a > training text would be a sensible way to specify it. > > As for the need to have multiple representations of a character, you > are right that you gain less from this when using straight digitally > generated characters. There is probably still some benefit to be had > in using several samples, to get more accurate metrics for its > position relative to the baseline and other letters. Less relevant > for a monospace font, though. > > Hopefully I've answered all your questions somewhat. Let me know if > I missed anything. > > Nick > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

