On Thu, Jan 7, 2010 at 5:43 AM, Jukka Zitting <[email protected]>wrote:
> Hi, > > On Thu, Jan 7, 2010 at 7:38 AM, Godmar Back <[email protected]> wrote: > > when parsing an PDF file with 0.8.0incubator using the 'ExtractText' > driver, > > I'm seeing these errors: > > > > Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine > > processOperator > > INFO: unsupported/disabled operation: rg > > [...] > > According to table 74 in the PDF spec [1], 'rg' is a perfectly legal > color > > operator; I haven't looked up the others. > > The text extractor in PDFBox has explicitly been instructed to ignore > all color-related operators as they don't affect text extraction. So > in this case the operation is just "disabled", not "unsupported". > > I looked at Resources/PDFTextStripper.properties and saw that after I sent the email, my bad - sorry for not looking more closely. > Since these log messages are a bit misleading we recently got rid of > them for text extraction. See > https://issues.apache.org/jira/browse/PDFBOX-581 for the details. > > > The resulting .txt file, btw, contains: > > > > 9slashtwothreeslashtwozerozero8 > > > > where 'pdftotext' produces: > > > > 9/23/2008 > > Hmm, that's interesting. Would you mind filing an issue in > https://issues.apache.org/jira/browse/PDFBOX about this? > > Sure, see https://issues.apache.org/jira/browse/PDFBOX-595 I also filed an issue about the parser problem at https://issues.apache.org/jira/browse/PDFBOX-592 But, I'd much rather help fixing this - could you point me where to look? It seems somewhere the name of character is printed when its code should be, where would that happen? - Godmar

