[tesseract-ocr] Re: Mathematical Formulae recognition

Leopold Hamminger Thu, 31 Oct 2019 06:49:46 -0700

Hi,

I came across this conversation regarding formulae. May I ask whether you 
have made any progress?


I need a solution for this as well. Am glad to cooperate in testing etc.

Greetings,
Leo Hamminger

Am Dienstag, 16. Dezember 2008 23:34:21 UTC+1 schrieb Ray Smith:
>
> You would need to cut out most of the code in the textord directory, and 
> just run the classifier directly on the blobs, with the baseline correction 
> feature disabled.
>
> This means:
> bypass filter_blobs and textord_page in edges_and_textord, making fake 
> words and text lines from individiual blobs. The code in applybox.cpp might 
> give you some idea of how to do this.
> Set IntegerMatcherMultiplier to zero.
>
> Ray.
>
> On Fri, Dec 12, 2008 at 1:12 AM, lab <la...@lbreyer.com <javascript:>> 
> wrote:
>
>>
>> Ray,
>>
>> can you explain what you mean by skipping text line and word finding,
>> ie how to enable or disable this correctly in tesseract?
>>
>> I've had mixed results with the standard tesseract 2.03 (debian,
>> default options) on mathematical documents. Most sentences with simple
>> formulas or isolated mathematical symbols can be read reasonably well
>> after training some sample pages, but displayed equations and formulas
>> (ie on their own line(s)) are usually completely garbled.  Moderately
>> simple symbols with both a superscript and a subscript cannot usually
>> be recognized at all. Also, having both superscripts and subscripts
>> somewhere in a single formula can confuse tesseract so that it thinks
>> the superscript belongs to the previous line or an "extra" line in
>> between. I've also observed that sometimes, the same symbol can be
>> recognized easily when it occurs in a subscript position, but is often
>> mistaken when it occurs in a superscript position.
>>
>> lab.
>>
>> On Dec 12, 8:51 am, "Ray Smith" <theraysm...@gmail.com> wrote:
>> > This problem has not been attempted before with tesseract.
>> > The biggest thing to watch out for is to skip the text line and word
>> > finding. You might have significant success just running the classifier 
>> on
>> > the connected components.
>> > Training might be a bit tricky too, since it relies on the text line 
>> finder.
>> > Ray.
>> >
>> > Sent from my G1 Android Phone.
>> >
>> > On Dec 10, 2008 10:45 PM, "jean" <jean.f...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I'm interested in developing an OCR to read math formulas using
>> > tesseract as my platform. I have been trying to use tesseract to read
>> > LATEX image files. I have tried reading the squareroot of x+2, and
>> > tesseract read it as vx+2. For the sqrt(x+ sqrt(2)), tesseract sees
>> > J@. No big surprise since tesseract wasn't made for understanding the
>> > recursive nature of math formulas.
>> >
>> > So my question is what progress has been made on a tesseract-based
>> > math-OCR? And would there be any things I need to watch out for?
>> >
>> > --Jean
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a4a328b-7659-4264-9d72-b45fceeff20e%40googlegroups.com.

[tesseract-ocr] Re: Mathematical Formulae recognition

Reply via email to