I have logged a JIRA for the issue.

https://issues.apache.org/jira/browse/PDFBOX-5025

On Mon, Nov 23, 2020 at 11:13 PM Cody Holmes <cholmes5...@gmail.com> wrote:

> Hello,
>
> I have found an issue in the latest version of PDFBox where parsing fails
> in the BaseParser when `parseDirObject` parses a number and the following
> string starts with an 'e'.
>
> This is due to the attempt to include numbers stored in scientific
> notation.
>
> I have found one way that seems to resolve this problem is by checking if
> the last character in the read number string is an e or E. If it is then
> removing it from the read string and unreading it from the source allows
> parsing to complete as expected.
>
> ```
>
> private COSNumber parseCOSNumber() throws IOException
> {
>     ...
>
>     // Remove last character if it is not a number
>     char lastc = buf.charAt(buf.length() - 1);
>     if (lastc == 'e' || lastc == 'E')
>     {
>         buf.deleteCharAt(buf.length() - 1);
>         seqSource.unread(lastc);
>     }
>     return COSNumber.get(buf.toString());
> }
>
> ```
>
> An example of this error can be seen in PDF.js issue3323.
>
>
> https://github.com/mozilla/pdf.js/commit/26f5b1b2d37c7b74a073dee75d66fcc04fae10e8
>
>
> https://github.com/mozilla/pdf.js/blob/4ba28de2608866dcb10d627d77dc19ff3d017c17/test/pdfs/issue3323.pdf
>
> I can contribute the change if needed, but will need to go through the
> contribution guides and run further validation to confirm this change won't
> break any other workflows.
>
> Thanks,
>
> Cody Holmes
>

Reply via email to