Re: Extracting symbols from Text

Jukka Zitting Wed, 25 Aug 2010 04:52:41 -0700

Hi,

On Tue, Aug 24, 2010 at 9:50 PM, Yogesh <[email protected]> wrote:
> I have PDFs for scientific literature. I want to extract all the notations
> like alpha, beta, gamma, delta and some other symbols along with the text.
> The PDFTextStripper works fine and gives me the text.
> How can I get these symbols along with the text the way it occurs in the
> PDF?


Those symbols are probably coming from a special font for which there
isn't a mapping to Unicode. Without such a mapping PDFBox can't tell
what character to output for each symbol.

You can try to inspect the PDF document for the font that's used, and
look for existing CMap files for that font. In the worst case you may
need to construct such a character map yourself to teach PDFBox how to
interpret the symbols used in your document.

BR,

Jukka Zitting

Re: Extracting symbols from Text

Reply via email to