I'm trying to examine an existing PDF file. Initially to extract text and maybe images, but ultimately to apply some logic to the formatting of the text to turn it into valid HTML with H1, H2, ul, li, etc. I thought I would start like this:
PDFStreamParser sParse = new PDFStreamParser(fileItem.get()); Object token = sParse.parseNextToken(); while (token != null) { logger.info("token: " + token); token = sParse.parseNextToken(); } That yields: file size: 5289793 token: COSInt{6066} token: COSInt{0} token: PDFOperator{obj} token: COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};} token: PDFOperator{stream} token: PDFOperator{hÞìÛ} token: COSNull{} token: PDFOperator{ ·½'à¯R—» '"Y¬} token: COSInt{7} token: PDFOperator{àà} Error trying to process request java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream offset 125 at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311) I'm using PDFBox 2.0.19. I'm probably doing this wrong at many levels. When I went to look at the samples on the web site, the classes in the 1.8 samples don't exist any more. The link to the sources for 2.0 samples actually has 3.0 samples, whose classes don't exist yet. So I just kind of bumbled along looking at the source code and guessing. If I had to guess what I'm seeing, everything looks good up until PDFOperator{stream}, after which, it looks like all garbage until it blows up. What do I do now? Is there an example somewhere of how I should be doing this that you could just point me to? My sample file opens well in the Ubuntu 18.04 PDF viewer. -- Glen K. Peterson (828) 393-0081