I don't know if it's strictly necessary for my application, but I am trying to analyze anywhere from a few characters up to a few lines of text rapidly. Tesseract is a portion of my application pipeline. I've got my own document layout engine since there's a lot of really specialized, mostly useless domain knowledge.
OCR is currently taking up over half the total analysis time. I managed to reduce it from about 60% to about 20% by (2sec per document to 0.8sec per document) using multiprocessing, I launch several jobs from the command line in parallel. That's roughly 4x speedup on a quad-core so that's good. But I'm still interested in pushing further. In the ideal world I'd have four tesseract daemons on all the time and when I need OCR done I pipe a filename in (or perhaps the image data) and get a string out. Or something like that. My thought is that it takes a certain amount of time to load up the binary and the training data and get organized in memory. Right now this whole process happens every time I need to process a file, perhaps 10-20 times per document. That could be a substantial amount of overhead. I fed tesseract a 1x1 white tiff 10 times and it took between 30ms and 44ms to load and tell me that there was no output. Let's just assume for a moment that those numbers aren't totally bogus, that means out of 0.8sec per document I'm spending 10x(30ms to 40ms)=300ms to 400ms of time just loading up the binary. That could be half of my total document processing time. I haven't gone looking at the guts to try and figure out if this is possible yet. I was hoping to get some feedback as to how dumb (or perhaps not!) of an idea this is before I really launched into it. So what does everyone think? Would this be helpful to anyone else? Does tesseract's architecture lend itself to staying in RAM for an extended period of time, for multiple images? I do realize that I could potentially just write out a single image with all the regions of interest contained within it, but my guess is that tesseract does some learning about what the font is as it processes characters. And since each document might have different fonts, font sizes, etc I think that may be more harmful than beneficial. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

