On Thu, Jan 7, 2010 at 5:43 AM, Jukka Zitting <[email protected]>wrote:

> Hi,
>
> On Thu, Jan 7, 2010 at 7:38 AM, Godmar Back <[email protected]> wrote:
> > when parsing an PDF file with 0.8.0incubator using the 'ExtractText'
> driver,
> > I'm seeing these errors:
> >
> > Jan 7, 2010 12:32:27 AM org.apache.pdfbox.util.PDFStreamEngine
> > processOperator
> > INFO: unsupported/disabled operation: rg
> > [...]
> > According to table 74 in the PDF spec [1], 'rg' is a perfectly legal
> color
> > operator; I haven't looked up the others.
>
> The text extractor in PDFBox has explicitly been instructed to ignore
> all color-related operators as they don't affect text extraction. So
> in this case the operation is just "disabled", not "unsupported".
>
>
I looked at Resources/PDFTextStripper.properties and saw that after I sent
the email, my bad - sorry for not looking more closely.


> Since these log messages are a bit misleading we recently got rid of
> them for text extraction. See
> https://issues.apache.org/jira/browse/PDFBOX-581 for the details.
>
> > The resulting .txt file, btw, contains:
> >
> > 9slashtwothreeslashtwozerozero8
> >
> > where 'pdftotext' produces:
> >
> > 9/23/2008
>
> Hmm, that's interesting. Would you mind filing an issue in
> https://issues.apache.org/jira/browse/PDFBOX about this?
>
>
Sure, see https://issues.apache.org/jira/browse/PDFBOX-595
I also filed an issue about the parser problem at
https://issues.apache.org/jira/browse/PDFBOX-592

But, I'd much rather help fixing this - could you point me where to look?
It seems somewhere the name of character is printed when its code should be,
where would that happen?

 - Godmar

Reply via email to