Re: [tesseract-ocr] Improving text recognition in musical scores

'Max Poliakovski' via tesseract-ocr Tue, 23 Jan 2018 16:44:42 -0800

Hello Art,

If your process to identify musical objects gives coordinates, you might be 
> able to leverage those to divide the image into smaller sections and then 
> apply tesseract to those. I tried removing lines from the image with 
> leptonica and then using olena to identify text sections on the page (olena 
> will think the staves designate text without removing the lines). The 
> attachment shows how close olena could get to identifying text sections, I 
> suspect the trick is an approach like this where you extract the text 
> regions and then use tesseract on them individually.
>

The results you obtained with Scribo look promising! It sounds like Scribo
could help to overcome shortcomings in the Tesseract's layout analysis
Audiveris is currently relying on.

There are still several difficult cases we need to address, among those:

- lyrics syllables consisting of a single character (mostly a vowel). I
doubt Scribo/Tesseract would be ever able to recognize those automatically
- dynamics written in italic (*p mf ff fff*)
- certain character sequences being mis-interpreted as text (tuplets
symbols involving brackets)

It looks like we need to adapt a more sophisticated approach instead of the
current "single pass" one. Here is a sketch:

1) image preprocessing and binarization
2) labeling of staves and long-and-thin symbols (beams, slurs etc.) because
those will likely confuse OCR layout analysis
3) temporal removal of symbols labeled in step 2
4) OCR layout analysis (without actual text recognition)
5) recognition of fixed-shape musical symbols
6) recognition of textual items
7) putting everything into a graph and trying to find a feasible
interpretation of the data gathered during 1-6
8) interactive refinement involving human operator

Because a fully automatic text identification isn't possible (as opposite
to addressing the most common cases), a simple UI letting the user to
verify/correct the result of the layout analysis could be incorporated
after step 4.

Let's assume we've successfully identified all text items. Now we need to
properly recognize them which raises another challenge.

Chords, for example, utilize a very restricted symbol set and can also
contain musical symbols like ♯,♭as well as superscript characters. I'm
afraid that we have to train Tesseract to recognize musical symbols first
and then play with specifying external grammars, disabling dictionaries and
using "whitelists". Otherwise, Tesseract will most likely spit out garbage
instead of properly recognized chords.

Is there someone that was able to successfully recognize unusual character
sequences (math formulas, special codes etc.) with Tesseract? Which tricks
were involved? Real-worlds examples would be great...

For lyrics, we'll need to tell Tesseract to consider standalone syllables
as part of longer words for ambiguities to be resolved automatically. One
possibility is to remove whitespaces between the syllables relying on some
heuristics. I'm afraid we will end up having a fragile system...

Further ideas?

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/f0d34df5-9bd8-4ed6-9c27-06a8eeecfa64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Improving text recognition in musical scores

Reply via email to