Thanks, Tilman.
I did notice both PDFBOX-5727 and the pending optimization pull request
that you mentioned. It's interesting (and counterintuitive) to me that
we'd start experiencing severe performance issues after the switch to
CRC32 and the fix for "skipexception." But that's what we're seeing. (We
have not tested on 2.0.30, so who knows, that might have been worse.)
FWIW, we expect to move to PDFBox 3.0.x shortly, as soon as a Tika
release supports it, so a performance fix there would be fine with us.
Yes, in theory, we could "prepare" the test images with .pdfbox.cache
and potentially other artifacts, but we'd prefer not to. The more we
populate upfront, the less functionality we're actually testing on an
ongoing basis, and the more we deviate from what production servers
experience. In this case, we'd likely never have noticed that cache
population could take up to four minutes on our customers' production
deployments; this is critical information to have. There's also the
likelihood that the format will change and the file will need to be
re-generated (e.g., that optimization PR will change the format, won't it?).
I'll look at enabling more logging, and determining the font count and
size on these instances, and report back if there's interest.
Adam
On 6/24/2024 8:16 PM, Tilman Hausherr wrote:
Hi,
There was a poorly thought change that made it all slower (using an
SHA512 checksum for each font) but that was fixed (much faster
checksum method, because this isn't crypto so CRC32 is enough) in 2.0.31
https://issues.apache.org/jira/browse/PDFBOX-5727
I don't know anything about AWS, are these image based containers that
can be "prepared"? If yes that it's best to have the appropriate
.pdfbox.cache file in it.
Yes it is related to the count and size of fonts. You could have a
look at this file and compare it to your own. Same for adding debug
logging.
There is a pending ticket that hasn't been reviewed or committed yet,
that will allow to skip the checksum and that is also assumed to be
faster.
https://github.com/apache/pdfbox/pull/189
However I don't know if that one can also be ported to 2.0.32 (which
was meant to be released soon anyway)
Tilman
On 24.06.2024 23:48, Adam Rauch wrote:
Greetings,
We use PDFBox alongside Tika to support full-text search indexing and
querying. Our Windows test agents (fairly powerful AWS instances)
began timing out many tests after we upgraded PDFBox from 2.0.29 to
2.0.31. We tracked the problem down to the on-disk font cache
population process which is taking between two and four MINUTES to
complete on these instances. For test consistency purposes, these
agents are fairly "clean" when they start up; they don't have an
on-disk font cache so it's created the first time a PDF is parsed.
This has never been a problem before.
Our temporary workaround is to force the font cache to be generated
in the background at server startup. We call
FontMapperImpl.getProvider() via reflection; I wish there were a
cleaner way to do this, but it gets the job done. We are risking a
race condition here, however, since the tests could easily start
indexing PDFs before the cache is written.
We log elapsed time for populating this cache. The last three runs show:
- Ensuring PDFBox on-disk font cache took 182.3 seconds
- Ensuring PDFBox on-disk font cache took 230.6 seconds
- Ensuring PDFBox on-disk font cache took 137.6 seconds
Significant variance, but always multiple minutes. My local (Windows)
laptop takes a fraction of a second to recreate this cache, so our
ability to debug or profile this performance problem is limited. The
problem showed up with 2.0.31, so we don't have timings from previous
PDFBox versions.
I realize there's not a lot of information to go on here, but I'm
curious if anyone else has experienced this with 2.0.31. We're happy
to provide more information from our instances... maybe turning on
additional logging would be helpful? Count and size of fonts on these
instances?
Thanks,
Adam
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org