Hi,
So you're using a "special" http client...
Anyway, here's what I just did with the 1.8.9 version:
URL url = new
URL("http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf");
InputStream is = url.openStream();
PDDocument doc = PDDocument.load(is);
System.out.println("pages: " + doc.getNumberOfPages());
All output I get is
pages: 2
Btw the two "errors" you mention are warnings about malformed PDFs.
However there's really a length 66346 in your file and I don't get that
warning. This means that somehow you're not getting the exact file.
Maybe save what you're downloading with your "http client" and compare
it with that you download with a browser. Or try what I did and see if
it works.
What version are you using? 1.8.9 is the current one.
Tilman
Am 13.12.2014 um 18:41 schrieb Walter Kehl:
Hi John, Tilman,
thanks for the reply. Here is some additional information:
- the http client I am using to get the input stream already has a user
agent set. Also I have downloaded with PDF box already lots of PDF files
where there never was a problem.
- when I try to load the document remotely from the URL, I get the following
error messages:
18:34:32 WARN BaseParser :: Specified stream length 66346 is
wrong. Fall back to reading stream until 'endstream'.
18:34:35 WARN XrefTrailerResolver :: Did not found XRef object at
specified startxref position 0
- I have written the input stream directly to a file and it was a valid PDF.
It could load it both with an external tool and with PDFBox.
Yes, of course I could always download a file first to a temp file and then
load it into PDFBox. But I think the direct way is more elegant and faster.
I have also debugged a little bit into the code and to me it doesn't look
like PDFBox uses a temporary file, but rather reads directly from the input
stream.... but I might be wrong.
Anyway, thanks for providing such a good free software!
Best
Walter
-----Original Message-----
From: John Hewson [mailto:[email protected]]
Sent: Freitag, 12. Dezember 2014 18:57
To: [email protected]
Subject: Re: Downloadind a pdf file doesn't work
Good point Tilman. Walter, try saving writing the InputStream to a File and
check that it's a valid PDF.
-- John
On 12 Dec 2014, at 09:50, Tilman Hausherr <[email protected]> wrote:
This sounds more like a http problem. Try setting a user agent like a
browser.
https://stackoverflow.com/questions/2529682/setting-user-agent-of-a-ja
va-urlconnection
Tilman
Am 12.12.2014 um 11:53 schrieb Walter Kehl:
Hi all,
I have the following situation:
I am loading with PdfBox files from the internet with the call
PDDocument document = PDDocument.load( inputStream );
So far it has worked nicely, but I have problems with this file :
http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf
After I load it, it is empty, and the call
document.getNumberOfPages() returns 0.
However when I download the file manually and then load it into
PdfBox, then everything is fine.
Any idea what could be happening? I am currently using PdfBox 1.8.5.
Thanks and Best Regards
Walter