Hi, On Tue, Aug 24, 2010 at 9:50 PM, Yogesh <[email protected]> wrote: > I have PDFs for scientific literature. I want to extract all the notations > like alpha, beta, gamma, delta and some other symbols along with the text. > The PDFTextStripper works fine and gives me the text. > How can I get these symbols along with the text the way it occurs in the > PDF?
Those symbols are probably coming from a special font for which there isn't a mapping to Unicode. Without such a mapping PDFBox can't tell what character to output for each symbol. You can try to inspect the PDF document for the font that's used, and look for existing CMap files for that font. In the worst case you may need to construct such a character map yourself to teach PDFBox how to interpret the symbols used in your document. BR, Jukka Zitting

