I'm trying to examine an existing PDF file.  Initially to extract text and
maybe images, but ultimately to apply some logic to the formatting of the
text to turn it into valid HTML with H1, H2, ul, li, etc.  I thought I
would start like this:

PDFStreamParser sParse = new PDFStreamParser(fileItem.get());
Object token = sParse.parseNextToken();
while (token != null) {
    logger.info("token: " + token);
    token = sParse.parseNextToken();
}

That yields:

file size: 5289793
token: COSInt{6066}
token: COSInt{0}
token: PDFOperator{obj}
token:
COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};}
token: PDFOperator{stream}
token: PDFOperator{hÞìÛ}
token: COSNull{}
token: PDFOperator{ ·½'à¯R—» '"Y¬}
token: COSInt{7}
token: PDFOperator{àà}
Error trying to process request
java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream
offset 125
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311)

I'm using PDFBox 2.0.19.

I'm probably doing this wrong at many levels.  When I went to look at the
samples on the web site, the classes in the 1.8 samples don't exist any
more.  The link to the sources for 2.0 samples actually has 3.0 samples,
whose classes don't exist yet.  So I just kind of bumbled along looking at
the source code and guessing.

If I had to guess what I'm seeing, everything looks good up
until PDFOperator{stream}, after which, it looks like all garbage until it
blows up.  What do I do now?

Is there an example somewhere of how I should be doing this that you could
just point me to?  My sample file opens well in the Ubuntu 18.04 PDF viewer.

-- 
Glen K. Peterson
(828) 393-0081

Reply via email to