Hi Laird, hi Jimmy,
Thanks for your answers.
lab wrote:
Tesseract cannot read display formulas, its fundamental model is only
linear.
Unless and until that changes, the best you can hope for is
recognizing symbols
in text, and you will have to watch out for problems with superscripts
and subscripts.
That is what I thought.
There is a project which claims to have that capability (here:
http://www.inftyproject.org/en/software.html#InftyReader) but it isn't
Free Software and only runs on Windows machines,
and I haven't any personal experience with it. Caveat emptor.
Several people have linked that project now. However, I don't have
Windows to even test that and I am searching esp. for a free/open and
cross-platform solution which I can use in own projects.
Jimmy O'Regan wrote:
[I don't know what e-mail client you're using, but it's completely
useless at quoting text]
[Yea I know, that's Thunderbird...]
As it is, I pointed you to the enhancement request, which, as you seem
to not have read it, has some - admittedly, not much - extra
information on the topic.
Ah sorry, I missunderstood the request. The description of it is just
about the symbols that is why I thought this request is about symbols.
The original poster only added in an additional comment that this
request may be extended to full formula recognition -- which is in my
eyes a very different request, so it would have fit better into another,
separated request.
Also, despite the link to the Inftyproject and a comment that it is not
open source, the rest of the discussion is just about symbol recognition
(and esp. about how Detexify works).
Is it possible to extend Tesseract to be able to do this or is some heavy
redesign of the whole engine needed (and some fundamental other technics) to
do this?
The only current system available for maths recognition - the link is
in the enhancement request - contains its maths recognition as a
separate engine. I don't think that's strictly necessary, but maths
would need to be processed in an entirely different way, and a formula
detection mechanism would be required to ensure it is handled in a
different way. At the very least, the formula would need to be
segmented into a grid, because relative position and size is much more
significant than in text - not just in detecting
superscripts/subscripts, but also in determining if pi means pi or
product, etc.
Thanks for this evalutation.
I will see what I can do. Maybe I will try to play around with this a
bit myself. It was anyway just for a small side project for me so I am
not sure yet how much time I want to invest into this. I will let you
know if I have something interesting for you.
Cya,
Albert
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.