Hello Art,

If your process to identify musical objects gives coordinates, you might be 
> able to leverage those to divide the image into smaller sections and then 
> apply tesseract to those. I tried removing lines from the image with 
> leptonica and then using olena to identify text sections on the page (olena 
> will think the staves designate text without removing the lines). The 
> attachment shows how close olena could get to identifying text sections, I 
> suspect the trick is an approach like this where you extract the text 
> regions and then use tesseract on them individually.
>

The results you obtained with Scribo look promising! It sounds like Scribo 
could help to overcome shortcomings in the Tesseract's layout analysis 
Audiveris is currently relying on.

There are still several difficult cases we need to address, among those:

   - lyrics syllables consisting of a single character (mostly a vowel). I 
   doubt Scribo/Tesseract would be ever able to recognize those automatically
   - dynamics written in italic (*p mf ff fff*)
   - certain character sequences being mis-interpreted as text (tuplets 
   symbols involving brackets)

It looks like we need to adapt a more sophisticated approach instead of the 
current "single pass" one. Here is a sketch:

1) image preprocessing and binarization
2) labeling of staves and long-and-thin symbols (beams, slurs etc.) because 
those will likely confuse OCR layout analysis
3) temporal removal of symbols labeled in step 2
4) OCR layout analysis (without actual text recognition)
5) recognition of fixed-shape musical symbols
6) recognition of textual items
7) putting everything into a graph and trying to find a feasible 
interpretation of the data gathered during 1-6
8) interactive refinement involving human operator

Because a fully automatic text identification isn't possible (as opposite 
to addressing the most common cases), a simple UI letting the user to 
verify/correct the result of the layout analysis could be incorporated 
after step 4.

Let's assume we've successfully identified all text items. Now we need to 
properly recognize them which raises another challenge.

Chords, for example, utilize a very restricted symbol set and can also 
contain musical symbols like ♯,♭as well as superscript characters. I'm 
afraid that we have to train Tesseract to recognize musical symbols first 
and then play with specifying external grammars, disabling dictionaries and 
using "whitelists". Otherwise, Tesseract will most likely spit out garbage 
instead of properly recognized chords.

Is there someone that was able to successfully recognize unusual character 
sequences (math formulas, special codes etc.) with Tesseract? Which tricks 
were involved? Real-worlds examples would be great...

For lyrics, we'll need to tell Tesseract to consider standalone syllables 
as part of longer words for ambiguities to be resolved automatically. One 
possibility is to remove whitespaces between the syllables relying on some 
heuristics. I'm afraid we will end up having a fragile system...

Further ideas?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f0d34df5-9bd8-4ed6-9c27-06a8eeecfa64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to