I have logged a JIRA for the issue. https://issues.apache.org/jira/browse/PDFBOX-5025
On Mon, Nov 23, 2020 at 11:13 PM Cody Holmes <cholmes5...@gmail.com> wrote: > Hello, > > I have found an issue in the latest version of PDFBox where parsing fails > in the BaseParser when `parseDirObject` parses a number and the following > string starts with an 'e'. > > This is due to the attempt to include numbers stored in scientific > notation. > > I have found one way that seems to resolve this problem is by checking if > the last character in the read number string is an e or E. If it is then > removing it from the read string and unreading it from the source allows > parsing to complete as expected. > > ``` > > private COSNumber parseCOSNumber() throws IOException > { > ... > > // Remove last character if it is not a number > char lastc = buf.charAt(buf.length() - 1); > if (lastc == 'e' || lastc == 'E') > { > buf.deleteCharAt(buf.length() - 1); > seqSource.unread(lastc); > } > return COSNumber.get(buf.toString()); > } > > ``` > > An example of this error can be seen in PDF.js issue3323. > > > https://github.com/mozilla/pdf.js/commit/26f5b1b2d37c7b74a073dee75d66fcc04fae10e8 > > > https://github.com/mozilla/pdf.js/blob/4ba28de2608866dcb10d627d77dc19ff3d017c17/test/pdfs/issue3323.pdf > > I can contribute the change if needed, but will need to go through the > contribution guides and run further validation to confirm this change won't > break any other workflows. > > Thanks, > > Cody Holmes >