Re: Junk Characters while Extracting text from pdf file.

2013-02-05 Thread Maruan Sahyoun
Hi Kulbhushan, is it possible to extract the text using Adobe Reader or Adobe Acrobat without the junk characters? If no PDFBox can't help too. If yes could you open a case at Jira (https://issues.apache.org/jira/browse/PDFBOX) and attach a sample PDF which enables us to reproduce the issue.

Re: Printing a PDF doc with images

2013-02-05 Thread Eliot Kimber
I'm pretty sure I had to build from source. If that's not going to be easy for you I can provide the jar I built offline but there's probably a better source. Cheers, Eliot On 2/5/13 2:51 PM, "Alain" wrote: > Eliot, thanks for the reply! > > I am currently running 1.7.1, where did you find r

Re: Fwd: Junk Characters while Extracting text from pdf file.

2013-02-05 Thread Peter Murray-Rust
On Tue, Feb 5, 2013 at 6:36 PM, Andreas Lehmkuehler wrote: > Hi, > > Am 05.02.2013 15:01, schrieb kulbhushan singh: > > Hi, >> >> I am trying to extract text from a pdf file with custom fonts but it is >> giving me junk characters. The fonts used are ArialMT (embedded subset) & >> Arial-BoldMT (e

Re: Printing a PDF doc with images

2013-02-05 Thread Alain
Eliot, thanks for the reply! I am currently running 1.7.1, where did you find release 1.8? Alain From: Eliot Kimber To: "users@pdfbox.apache.org" ; Alain Sent: Tuesday, February 5, 2013 3:46 PM Subject: Re: Printing a PDF doc with images Not sure if it'

Re: Printing a PDF doc with images

2013-02-05 Thread Eliot Kimber
Not sure if it's the same issue, but I ran into a problem with scanned images that used an overlay mask. Those images are not handled correctly by PDFBox 1.7.1 but the handling is corrected in 1.8 as of last November some time. So you might try using the latest 1.8 build and see if it resolves your

Junk Characters while Extracting text from pdf file.

2013-02-05 Thread kulbhushan singh
Hi, I am trying to extract text from a pdf file with custom fonts but it is giving me junk characters. The fonts used are ArialMT (embedded subset) & Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost script 8.15. I am using PDFTextStripper to extract the text. How can do it for

Re: Getting Out of Memory Error when trying to parse and extract text of 8 MB PDF Document

2013-02-05 Thread Maruan Sahyoun
if you set the system property "org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal" to true page content is only parsed if you request the page. In addition the default behavior also creates less objects. Although this is not the behavior you requested for i.e. streaming like Stax

Re: Getting Out of Memory Error when trying to parse and extract text of 8 MB PDF Document

2013-02-05 Thread Andreas Lehmkuehler
Hi, Am 05.02.2013 15:20, schrieb VIGNESH S: I think non sequential PDF Parser also loads everyobjects in Objectpool.. The diffrence I think in nonsequential is that it reads the Xref table in trailer to know the PDF structure instead of linearly traversing the document. Yes, it works different

Re: Fwd: Junk Characters while Extracting text from pdf file.

2013-02-05 Thread Andreas Lehmkuehler
Hi, Am 05.02.2013 15:01, schrieb kulbhushan singh: Hi, I am trying to extract text from a pdf file with custom fonts but it is giving me junk characters. The fonts used are ArialMT (embedded subset) & Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost script 8.15. I am using

Re: Getting Out of Memory Error when trying to parse and extract text of 8 MB PDF Document

2013-02-05 Thread VIGNESH S
I think non sequential PDF Parser also loads everyobjects in Objectpool.. The diffrence I think in nonsequential is that it reads the Xref table in trailer to know the PDF structure instead of linearly traversing the document. Correct me if Iam wrong. On Sat, Feb 2, 2013 at 11:58 AM, Maruan Sah

Fwd: Junk Characters while Extracting text from pdf file.

2013-02-05 Thread kulbhushan singh
Hi, I am trying to extract text from a pdf file with custom fonts but it is giving me junk characters. The fonts used are ArialMT (embedded subset) & Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost script 8.15. I am using PDFTextStripper to extract the text. How can do it for