Re: Problem reading a PDF file

Maruan Sahyoun Fri, 24 Apr 2020 07:15:39 -0700

Dear Glen,

PDFStreamParser is only for parsing PDF content streams (so specific parts of a 
PDF) and not the complete PDF. As a starting
point take a look at CustomGraphicsStreamEngine and or CustomPageDrawer in the 
examples package.


Also PDFTextStripper will give you some ideas how to process a PDF.

BR
Maruan
  
> I'm trying to examine an existing PDF file.  Initially to extract text and
> maybe images, but ultimately to apply some logic to the formatting of the
> text to turn it into valid HTML with H1, H2, ul, li, etc.  I thought I
> would start like this:
> 
> PDFStreamParser sParse = new PDFStreamParser(fileItem.get());
> Object token = sParse.parseNextToken();
> while (token != null) {
>     logger.info("token: " + token);
>     token = sParse.parseNextToken();
> }
> 
> That yields:
> 
> file size: 5289793
> token: COSInt{6066}
> token: COSInt{0}
> token: PDFOperator{obj}
> token:
> COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};}
> token: PDFOperator{stream}
> token: PDFOperator{hÞìÛ}
> token: COSNull{}
> token: PDFOperator{ ·½'à¯R—» '"Y¬}
> token: COSInt{7}
> token: PDFOperator{àà}
> Error trying to process request
> java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream
> offset 125
> at
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311)
> 
> I'm using PDFBox 2.0.19.
> 
> I'm probably doing this wrong at many levels.  When I went to look at the
> samples on the web site, the classes in the 1.8 samples don't exist any
> more.  The link to the sources for 2.0 samples actually has 3.0 samples,
> whose classes don't exist yet.  So I just kind of bumbled along looking at
> the source code and guessing.
> 
> If I had to guess what I'm seeing, everything looks good up
> until PDFOperator{stream}, after which, it looks like all garbage until it
> blows up.  What do I do now?
> 
> Is there an example somewhere of how I should be doing this that you could
> just point me to?  My sample file opens well in the Ubuntu 18.04 PDF viewer.
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Problem reading a PDF file

Reply via email to