Hi, when parsing an PDF file with 0.8.0incubator using the 'ExtractText' driver, I'm seeing these errors:
Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: rg Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: n Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: re Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: h Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: f Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: J Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: j Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: RG Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: m Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: l Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: S Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: BI Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: EI Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine processOperator INFO: unsupported/disabled operation: W The PDF in question contains commands inside compressed stream sections such as: 1 1 1 rg According to table 74 in the PDF spec [1], 'rg' is a perfectly legal color operator; I haven't looked up the others. The resulting .txt file, btw, contains: 9slashtwothreeslashtwozerozero8 where 'pdftotext' produces: 9/23/2008 It appears that PDFBox is not complete; my question is, how incomplete is it? I'm interesting in using the PDFBox plug-in in Nutch - if it's this incomplete, however, I'm wondering if I'm not better off writing my own, pdftotext-based plug-in for Nutch. - Godmar [1] http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

