Dear Glen, PDFStreamParser is only for parsing PDF content streams (so specific parts of a PDF) and not the complete PDF. As a starting point take a look at CustomGraphicsStreamEngine and or CustomPageDrawer in the examples package.
Also PDFTextStripper will give you some ideas how to process a PDF. BR Maruan > I'm trying to examine an existing PDF file. Initially to extract text and > maybe images, but ultimately to apply some logic to the formatting of the > text to turn it into valid HTML with H1, H2, ul, li, etc. I thought I > would start like this: > > PDFStreamParser sParse = new PDFStreamParser(fileItem.get()); > Object token = sParse.parseNextToken(); > while (token != null) { > logger.info("token: " + token); > token = sParse.parseNextToken(); > } > > That yields: > > file size: 5289793 > token: COSInt{6066} > token: COSInt{0} > token: PDFOperator{obj} > token: > COSDictionary{COSName{Filter}:COSName{FlateDecode};COSName{First}:COSInt{1193};COSName{Length}:COSInt{12594};COSName{N}:COSInt{98};COSName{Type}:COSName{ObjStm};} > token: PDFOperator{stream} > token: PDFOperator{hÞìÛ} > token: COSNull{} > token: PDFOperator{ ·½'à¯R—» '"Y¬} > token: COSInt{7} > token: PDFOperator{àà} > Error trying to process request > java.io.IOException: Error: Expected operator 'ID' actual='I6' at stream > offset 125 > at > org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:311) > > I'm using PDFBox 2.0.19. > > I'm probably doing this wrong at many levels. When I went to look at the > samples on the web site, the classes in the 1.8 samples don't exist any > more. The link to the sources for 2.0 samples actually has 3.0 samples, > whose classes don't exist yet. So I just kind of bumbled along looking at > the source code and guessing. > > If I had to guess what I'm seeing, everything looks good up > until PDFOperator{stream}, after which, it looks like all garbage until it > blows up. What do I do now? > > Is there an example somewhere of how I should be doing this that you could > just point me to? My sample file opens well in the Ubuntu 18.04 PDF viewer. > -- Maruan Sahyoun FileAffairs GmbH Josef-Schappe-Straße 21 40882 Ratingen Tel: +49 (2102) 89497 88 Fax: +49 (2102) 89497 91 sahy...@fileaffairs.de www.fileaffairs.de Geschäftsführer: Maruan Sahyoun Handelsregister: AG Düsseldorf, HRB 53837 UST.-ID: DE248275827 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org