On Tue, 7 Feb 2017 12:22:44 -0800 Manish Goregaokar <[email protected]> wrote:
> I found things like this[1] on wikisource which seems like an OCR of > some really garbled text. The text does indeed seem like it has > additional vowel diacritics, but that could also be a scanning glitch. > The same word appears twice in the document, but once in the text. In particular, the two sequences look like misinterpreted U+09CB BENGALI VOWEL SIGN O and U+09CC BENGALI VOWEL SIGN AU, which would account for their high frequency. The OCRed texts cited by Manish seem to be in acute need of manual correction. Richard.

