This would be very helpful for us too. Max
On Apr 8, 2011, at 3:33 AM, Mike Sandford wrote: > I don't know if it's strictly necessary for my application, but I am > trying to analyze anywhere from a few characters up to a few lines of > text rapidly. Tesseract is a portion of my application pipeline. > I've got my own document layout engine since there's a lot of really > specialized, mostly useless domain knowledge. > > OCR is currently taking up over half the total analysis time. I > managed to reduce it from about 60% to about 20% by (2sec per document > to 0.8sec per document) using multiprocessing, I launch several jobs > from the command line in parallel. That's roughly 4x speedup on a > quad-core so that's good. But I'm still interested in pushing > further. In the ideal world I'd have four tesseract daemons on all > the time and when I need OCR done I pipe a filename in (or perhaps the > image data) and get a string out. Or something like that. > > My thought is that it takes a certain amount of time to load up the > binary and the training data and get organized in memory. Right now > this whole process happens every time I need to process a file, > perhaps 10-20 times per document. That could be a substantial amount > of overhead. I fed tesseract a 1x1 white tiff 10 times and it took > between 30ms and 44ms to load and tell me that there was no output. > Let's just assume for a moment that those numbers aren't totally > bogus, that means out of 0.8sec per document I'm spending 10x(30ms to > 40ms)=300ms to 400ms of time just loading up the binary. That could > be half of my total document processing time. > > I haven't gone looking at the guts to try and figure out if this is > possible yet. I was hoping to get some feedback as to how dumb (or > perhaps not!) of an idea this is before I really launched into it. So > what does everyone think? Would this be helpful to anyone else? Does > tesseract's architecture lend itself to staying in RAM for an extended > period of time, for multiple images? > > I do realize that I could potentially just write out a single image > with all the regions of interest contained within it, but my guess is > that tesseract does some learning about what the font is as it > processes characters. And since each document might have different > fonts, font sizes, etc I think that may be more harmful than > beneficial. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

