Re: Fwd: Junk Characters while Extracting text from pdf file.

Andreas Lehmkuehler Wed, 06 Feb 2013 22:36:06 -0800

Hi,

Am 05.02.2013 22:06, schrieb Peter Murray-Rust:

On Tue, Feb 5, 2013 at 6:36 PM, Andreas Lehmkuehler <[email protected]>wrote:

Hi,

Am 05.02.2013 15:01, schrieb kulbhushan singh:

  Hi,


I am trying to extract text from a pdf file with custom fonts but it is
giving me junk characters. The fonts used are ArialMT (embedded subset) &
Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost
script 8.15. I am using PDFTextStripper to extract the text. How can do it
for custom fonts. Any reference or solution would be appreciated.

Did you do the "adobe" test? [1]


Does this require buying Adobe Acrobat? Or is there a free version?

No, just open the pdf in question using adobe reader, mark the (whole)
text and try to copy and paste it to an editor. "File -> Save as text"
should do the same. If both don't work the text can't be extracted. In
most cases a mapping to readable text is missing which it is not required
to render/print the pdf but to extract the text.

I have created heuristics for about 100 of these non-conformant fonts (
http://bitbucket.org/petermr/pdf2svg which uses PDFBox). If you mail me a
sample file I can see whether these would help. I have done several TeX
fonts (CMM etc.) but haven't done a Ghostcript one and it would be useful

But as Andreas says, ultimately these are probably non-conformant. A mixure

No, a missing mapping doesn't lead to a non-conformant pdf. It is still
valid.

of heuristics and glyph analysis (OCR and or heuristics) are required.
Again PDF2SVG is addressing these - any community involvement is valued.

Yes, that's the only workaround I know. Create an image for each page and
use some OCR software to get the text out of it.

BR
Andreas Lehmkühler

Re: Fwd: Junk Characters while Extracting text from pdf file.

Reply via email to