Hi Max,

Gosh, I am out of my depth on most of this. You might have an odd advantage 
with some of the unique symbols since they might lend themselves to something 
like template matching. Best of luck,

art

From: 'Max Poliakovski' via tesseract-ocr 
[mailto:[email protected]]
Sent: Tuesday, January 23, 2018 7:44 PM
To: tesseract-ocr <[email protected]>
Subject: Re: [tesseract-ocr] Improving text recognition in musical scores

Hello Art,
If your process to identify musical objects gives coordinates, you might be 
able to leverage those to divide the image into smaller sections and then apply 
tesseract to those. I tried removing lines from the image with leptonica and 
then using olena to identify text sections on the page (olena will think the 
staves designate text without removing the lines). The attachment shows how 
close olena could get to identifying text sections, I suspect the trick is an 
approach like this where you extract the text regions and then use tesseract on 
them individually.

The results you obtained with Scribo look promising! It sounds like Scribo 
could help to overcome shortcomings in the Tesseract's layout analysis 
Audiveris is currently relying on.

There are still several difficult cases we need to address, among those:

  *   lyrics syllables consisting of a single character (mostly a vowel). I 
doubt Scribo/Tesseract would be ever able to recognize those automatically
  *   dynamics written in italic (p mf ff fff)
  *   certain character sequences being mis-interpreted as text (tuplets 
symbols involving brackets)
It looks like we need to adapt a more sophisticated approach instead of the 
current "single pass" one. Here is a sketch:

1) image preprocessing and binarization
2) labeling of staves and long-and-thin symbols (beams, slurs etc.) because 
those will likely confuse OCR layout analysis
3) temporal removal of symbols labeled in step 2
4) OCR layout analysis (without actual text recognition)
5) recognition of fixed-shape musical symbols
6) recognition of textual items
7) putting everything into a graph and trying to find a feasible interpretation 
of the data gathered during 1-6
8) interactive refinement involving human operator

Because a fully automatic text identification isn't possible (as opposite to 
addressing the most common cases), a simple UI letting the user to 
verify/correct the result of the layout analysis could be incorporated after 
step 4.

Let's assume we've successfully identified all text items. Now we need to 
properly recognize them which raises another challenge.

Chords, for example, utilize a very restricted symbol set and can also contain 
musical symbols like ♯,♭as well as superscript characters. I'm afraid that we 
have to train Tesseract to recognize musical symbols first and then play with 
specifying external grammars, disabling dictionaries and using "whitelists". 
Otherwise, Tesseract will most likely spit out garbage instead of properly 
recognized chords.

Is there someone that was able to successfully recognize unusual character 
sequences (math formulas, special codes etc.) with Tesseract? Which tricks were 
involved? Real-worlds examples would be great...

For lyrics, we'll need to tell Tesseract to consider standalone syllables as 
part of longer words for ambiguities to be resolved automatically. One 
possibility is to remove whitespaces between the syllables relying on some 
heuristics. I'm afraid we will end up having a fragile system...

Further ideas?
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To post to this group, send email to 
[email protected]<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f0d34df5-9bd8-4ed6-9c27-06a8eeecfa64%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/f0d34df5-9bd8-4ed6-9c27-06a8eeecfa64%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/YQXPR0101MB19251F8EA8421AA58AFB7CCCDCE20%40YQXPR0101MB1925.CANPRD01.PROD.OUTLOOK.COM.
For more options, visit https://groups.google.com/d/optout.

Reply via email to