Hi,

when parsing an PDF file with 0.8.0incubator using the 'ExtractText' driver,
I'm seeing these errors:

Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: rg
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: n
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: re
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: h
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: f
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: J
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: j
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: RG
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: m
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: l
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: S
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: BI
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: EI
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
processOperator
INFO: unsupported/disabled operation: W

The PDF in question contains commands inside compressed stream sections such
as:

1 1 1 rg

According to table 74 in the PDF spec [1], 'rg' is a perfectly legal color
operator; I haven't looked up the others.

The resulting .txt file, btw, contains:

9slashtwothreeslashtwozerozero8

where 'pdftotext' produces:

9/23/2008

It appears that PDFBox is not complete; my question is, how incomplete is
it? I'm interesting in using the PDFBox plug-in in Nutch - if it's this
incomplete, however, I'm wondering if I'm not better off writing my own,
pdftotext-based plug-in for Nutch.

 - Godmar

[1] http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

Reply via email to