Re: Using Tesseract from a C++ application.

Robert Komar Wed, 07 Apr 2010 20:31:47 -0700

On Thu, 8 Apr 2010, MARTIN Pierre wrote:

Maybe _you_ could be the resource that helps them port to Windows.

i'll make them an offer of my skills, yes. But why not on Tesseract?

It would be less work than writing another OCR engine from scratch,
and you would get results a lot sooner.

Yes it seems so. But i'm not a big fond of Python, and as far as i understand 
what i've read, their core is in C++ but their whole API will be extended with 
Python.

Or is the problem that it is probably harder to keep a commercial product based 
on scripts
proprietary?

i don't understand this one. Can you explain better please?


I was thinking of an earlier post of yours where you were asking if
your source code would need to be re-distributed if you used
Tesseract.  I thought that a program based on scripts would probably
be more difficult to keep proprietary (i.e. not fully open source)
than a program based on compiled code, so maybe that's why you
didn't want to work with OCRopus.

I have no problem with anyone trying to make a commercial product
that uses Tesseract.  I was just trying to figure out why you seemed
to reject OCRopus so quickly.  Now I see it was because you don't
like Python, so the mystery is solved ;).

Anyway: do you think it may be possible for me to gather motivated people to 
continue the Tess project? A good reverse engeenerer, then a modeler using what 
the previous did, then a coder. i would be a bit of all of them, mostly the 
modler / coder. Unless there are at least one on each domain, it's useless to 
try, i don't have the required skills alone (i'm clearly lacking mathematics).


I suspect that OCR is not a simple problem that can be solved with a
clean design.  Tesseract is probably filled with small kludges and
workarounds to improve performance.  To throw out the code and begin
again based on Tesseract's general design probably means hitting and
working around all the same small problems they already dealt with.

Also, there may be problems in Tesseract's general design that would
be better to avoid in a new project.  For example, italics never seem
to be recognized correctly, and someone on this list pointed out a while
ago that the problem is that the bounding boxes for the italic characters
overlap, and this is not handled properly by Tesseract.  I'm sure there
are other fundamental problems.

For these reasons, I personally think it would be a mistake to start a
new project by reverse engineering Tesseract.  I do think that tweaking
the existing code to fix memory leaks and such (maybe introducing
doxygen comments to improve documentation) would be a good thing.

Hmm, the Tesseract page shows that two of the people with commit privilegeswork on OCRopus now. Maybe helping with OCRopus would be a roundabout

way of getting small fixes pushed upstream to Tesseract.  At the least,
they are probably in better contact with Ray Smith than anyone here.

Cheers,
Rob Komar

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Using Tesseract from a C++ application.

Reply via email to