Re: Eliminating super scripts while extracting text from pdf

Olaf Drümmer Fri, 28 Mar 2014 14:48:12 -0700

Two thoughts:

- keep track of the baseline and size of characters, if the baseline is 
slightly shifted (upwards -> superscript, downward -> subscript) and the size 
is smaller than surrounding characters, it's possibly a superscript or 
subscript character


- be aware of the fact that some fonts contain glyphs for superscripts - then 
baseline and text size would be the same; in such cases you'd have to look up 
via the Unicode code point whether you have encountered a superscript.

Olaf

Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <[email protected]>:

> Hi,
> 
> I am trying to extract text from pdf, and process the text. I have been
> successful in extraction, but could not get much benefits out of it as the
> extracted text treated the superscripts, usually numbers, as normal text.
> 
> A superscript to a word, which is the last word of a sentence, has been
> placed after the period(.)
> 
> ex: Word: "test" with superscript "super"
> When it appeared at the end of a sentence, has been extracted as -
> "test.super"
> 
> Is there any way I can get rid of superscripts?
> 
> -- 
> Br,
> Siva.

Re: Eliminating super scripts while extracting text from pdf

Reply via email to