This would be very helpful for us too.

Max

On Apr 8, 2011, at 3:33 AM, Mike Sandford wrote:

> I don't know if it's strictly necessary for my application, but I am
> trying to analyze anywhere from a few characters up to a few lines of
> text rapidly.  Tesseract is a portion of my application pipeline.
> I've got my own document layout engine since there's a lot of really
> specialized, mostly useless domain knowledge.
> 
> OCR is currently taking up over half the total analysis time.  I
> managed to reduce it from about 60% to about 20% by (2sec per document
> to 0.8sec per document) using multiprocessing, I launch several jobs
> from the command line in parallel.  That's roughly 4x speedup on a
> quad-core so that's good.  But I'm still interested in pushing
> further.  In the ideal world I'd have four tesseract daemons on all
> the time and when I need OCR done I pipe a filename in (or perhaps the
> image data) and get a string out.  Or something like that.
> 
> My thought is that it takes a certain amount of time to load up the
> binary and the training data and get organized in memory.  Right now
> this whole process happens every time I need to process a file,
> perhaps 10-20 times per document.  That could be a substantial amount
> of overhead.  I fed tesseract a 1x1 white tiff 10 times and it took
> between 30ms and 44ms to load and tell me that there was no output.
> Let's just assume for a moment that those numbers aren't totally
> bogus, that means out of 0.8sec per document I'm spending 10x(30ms to
> 40ms)=300ms to 400ms of time just loading up the binary.  That could
> be half of my total document processing time.
> 
> I haven't gone looking at the guts to try and figure out if this is
> possible yet.  I was hoping to get some feedback as to how dumb (or
> perhaps not!) of an idea this is before I really launched into it.  So
> what does everyone think?  Would this be helpful to anyone else?  Does
> tesseract's architecture lend itself to staying in RAM for an extended
> period of time, for multiple images?
> 
> I do realize that I could potentially just write out a single image
> with all the regions of interest contained within it, but my guess is
> that tesseract does some learning about what the font is as it
> processes characters.  And since each document might have different
> fonts, font sizes, etc I think that may be more harmful than
> beneficial.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/tesseract-ocr?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to