Two thoughts: - keep track of the baseline and size of characters, if the baseline is slightly shifted (upwards -> superscript, downward -> subscript) and the size is smaller than surrounding characters, it's possibly a superscript or subscript character
- be aware of the fact that some fonts contain glyphs for superscripts - then baseline and text size would be the same; in such cases you'd have to look up via the Unicode code point whether you have encountered a superscript. Olaf Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]>: > Hi, > > I am trying to extract text from pdf, and process the text. I have been > successful in extraction, but could not get much benefits out of it as the > extracted text treated the superscripts, usually numbers, as normal text. > > A superscript to a word, which is the last word of a sentence, has been > placed after the period(.) > > ex: Word: "test" with superscript "super" > When it appeared at the end of a sentence, has been extracted as - > "test.super" > > Is there any way I can get rid of superscripts? > > -- > Br, > Siva.

