Hi, On Thu, Jan 7, 2010 at 7:38 AM, Godmar Back <[email protected]> wrote: > when parsing an PDF file with 0.8.0incubator using the 'ExtractText' driver, > I'm seeing these errors: > > Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine > processOperator > INFO: unsupported/disabled operation: rg > [...] > According to table 74 in the PDF spec [1], 'rg' is a perfectly legal color > operator; I haven't looked up the others.
The text extractor in PDFBox has explicitly been instructed to ignore all color-related operators as they don't affect text extraction. So in this case the operation is just "disabled", not "unsupported". Since these log messages are a bit misleading we recently got rid of them for text extraction. See https://issues.apache.org/jira/browse/PDFBOX-581 for the details. > The resulting .txt file, btw, contains: > > 9slashtwothreeslashtwozerozero8 > > where 'pdftotext' produces: > > 9/23/2008 Hmm, that's interesting. Would you mind filing an issue in https://issues.apache.org/jira/browse/PDFBOX about this? BR, Jukka Zitting

