Hello Art, If your process to identify musical objects gives coordinates, you might be > able to leverage those to divide the image into smaller sections and then > apply tesseract to those. I tried removing lines from the image with > leptonica and then using olena to identify text sections on the page (olena > will think the staves designate text without removing the lines). The > attachment shows how close olena could get to identifying text sections, I > suspect the trick is an approach like this where you extract the text > regions and then use tesseract on them individually. >
The results you obtained with Scribo look promising! It sounds like Scribo could help to overcome shortcomings in the Tesseract's layout analysis Audiveris is currently relying on. There are still several difficult cases we need to address, among those: - lyrics syllables consisting of a single character (mostly a vowel). I doubt Scribo/Tesseract would be ever able to recognize those automatically - dynamics written in italic (*p mf ff fff*) - certain character sequences being mis-interpreted as text (tuplets symbols involving brackets) It looks like we need to adapt a more sophisticated approach instead of the current "single pass" one. Here is a sketch: 1) image preprocessing and binarization 2) labeling of staves and long-and-thin symbols (beams, slurs etc.) because those will likely confuse OCR layout analysis 3) temporal removal of symbols labeled in step 2 4) OCR layout analysis (without actual text recognition) 5) recognition of fixed-shape musical symbols 6) recognition of textual items 7) putting everything into a graph and trying to find a feasible interpretation of the data gathered during 1-6 8) interactive refinement involving human operator Because a fully automatic text identification isn't possible (as opposite to addressing the most common cases), a simple UI letting the user to verify/correct the result of the layout analysis could be incorporated after step 4. Let's assume we've successfully identified all text items. Now we need to properly recognize them which raises another challenge. Chords, for example, utilize a very restricted symbol set and can also contain musical symbols like ♯,♭as well as superscript characters. I'm afraid that we have to train Tesseract to recognize musical symbols first and then play with specifying external grammars, disabling dictionaries and using "whitelists". Otherwise, Tesseract will most likely spit out garbage instead of properly recognized chords. Is there someone that was able to successfully recognize unusual character sequences (math formulas, special codes etc.) with Tesseract? Which tricks were involved? Real-worlds examples would be great... For lyrics, we'll need to tell Tesseract to consider standalone syllables as part of longer words for ambiguities to be resolved automatically. One possibility is to remove whitespaces between the syllables relying on some heuristics. I'm afraid we will end up having a fragile system... Further ideas? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f0d34df5-9bd8-4ed6-9c27-06a8eeecfa64%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

