I don't know if it's strictly necessary for my application, but I am
trying to analyze anywhere from a few characters up to a few lines of
text rapidly.  Tesseract is a portion of my application pipeline.
I've got my own document layout engine since there's a lot of really
specialized, mostly useless domain knowledge.

OCR is currently taking up over half the total analysis time.  I
managed to reduce it from about 60% to about 20% by (2sec per document
to 0.8sec per document) using multiprocessing, I launch several jobs
from the command line in parallel.  That's roughly 4x speedup on a
quad-core so that's good.  But I'm still interested in pushing
further.  In the ideal world I'd have four tesseract daemons on all
the time and when I need OCR done I pipe a filename in (or perhaps the
image data) and get a string out.  Or something like that.

My thought is that it takes a certain amount of time to load up the
binary and the training data and get organized in memory.  Right now
this whole process happens every time I need to process a file,
perhaps 10-20 times per document.  That could be a substantial amount
of overhead.  I fed tesseract a 1x1 white tiff 10 times and it took
between 30ms and 44ms to load and tell me that there was no output.
Let's just assume for a moment that those numbers aren't totally
bogus, that means out of 0.8sec per document I'm spending 10x(30ms to
40ms)=300ms to 400ms of time just loading up the binary.  That could
be half of my total document processing time.

I haven't gone looking at the guts to try and figure out if this is
possible yet.  I was hoping to get some feedback as to how dumb (or
perhaps not!) of an idea this is before I really launched into it.  So
what does everyone think?  Would this be helpful to anyone else?  Does
tesseract's architecture lend itself to staying in RAM for an extended
period of time, for multiple images?

I do realize that I could potentially just write out a single image
with all the regions of interest contained within it, but my guess is
that tesseract does some learning about what the font is as it
processes characters.  And since each document might have different
fonts, font sizes, etc I think that may be more harmful than
beneficial.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to