Hello,

I have found an issue in the latest version of PDFBox where parsing fails
in the BaseParser when `parseDirObject` parses a number and the following
string starts with an 'e'.

This is due to the attempt to include numbers stored in scientific notation.

I have found one way that seems to resolve this problem is by checking if
the last character in the read number string is an e or E. If it is then
removing it from the read string and unreading it from the source allows
parsing to complete as expected.

```

private COSNumber parseCOSNumber() throws IOException
{
    ...

    // Remove last character if it is not a number
    char lastc = buf.charAt(buf.length() - 1);
    if (lastc == 'e' || lastc == 'E')
    {
        buf.deleteCharAt(buf.length() - 1);
        seqSource.unread(lastc);
    }
    return COSNumber.get(buf.toString());
}

```

An example of this error can be seen in PDF.js issue3323.

https://github.com/mozilla/pdf.js/commit/26f5b1b2d37c7b74a073dee75d66fcc04fae10e8

https://github.com/mozilla/pdf.js/blob/4ba28de2608866dcb10d627d77dc19ff3d017c17/test/pdfs/issue3323.pdf

I can contribute the change if needed, but will need to go through the
contribution guides and run further validation to confirm this change won't
break any other workflows.

Thanks,

Cody Holmes

Reply via email to