Hello,

Since pdfbox 3.0.2, we have an OutOfMemoryError when the method
PDPageContentStream.showText(String) is used with a TrueType font
(PD0FontType) to draw a text containing at least one space character.

Here is an example of code causing the error:
File fontFile = new File(ttcUrl.getFile());
TrueTypeCollection ttc = new TrueTypeCollection(fontFile);
var regularFont = PDType0Font.load(doc, ttc.getFontByName("Inter-Regular"),
true);
// Considering cs is a valid PDPageContentStream object.
cs.beginText();
cs.newLineAtOffset(10, 500);
cs.setFont(regularFont, 12);
cs.showText("This an example of sentence containing spaces");  // <--- The
error is thrown here.
cs.endText();

Here is the stack trace:
java.lang.OutOfMemoryError: Java heap space
        at java.base/java.lang.StringLatin1.newString(StringLatin1.java:752)
        at java.base/java.lang.String.substring(String.java:2839)
        at java.base/java.lang.String.subSequence(String.java:2872)
        at
java.base/java.util.regex.Matcher.getSubSequence(Matcher.java:1819)
        at java.base/java.util.regex.Matcher.group(Matcher.java:691)
        at java.base/java.util.regex.Matcher.group(Matcher.java:645)
        at
org.apache.fontbox.ttf.gsub.CompoundCharacterTokenizer.tokenize(CompoundCharacterTokenizer.java:108)
        at
org.apache.pdfbox.pdmodel.PDAbstractContentStream.encodeForGsub(PDAbstractContentStream.java:1621)
        at
org.apache.pdfbox.pdmodel.PDAbstractContentStream.showTextInternal(PDAbstractContentStream.java:303)
        at
org.apache.pdfbox.pdmodel.PDAbstractContentStream.showText(PDAbstractContentStream.java:267)
        at
org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:37)
[...]

After investigation, it comes from the changes introduced by the resolution
of https://issues.apache.org/jira/browse/PDFBOX-5808 in the
method org.apache.fontbox.ttf.gsub.CompoundCharacterTokenizer.tokenize(String),
especially those lines:
https://github.com/apache/pdfbox/blob/257ae934e676ab8177797f8b265213df3ffbd54c/fontbox/src/main/java/org/apache/fontbox/ttf/gsub/CompoundCharacterTokenizer.java#L113-L118

Maybe not decreasing the value of lastIndexOfPrevMatch when the
CompoundCharacterTokenizer has been initialized via the
constructor CompoundCharacterTokenizer(Pattern pattern) could be a solution
to avoid an infinite loop when the pattern "\s" is used to instantiate
CompoundCharacterTokenizer as it's the case in method
PDAbstractContentStream.encodeForGsub(). Indeed, in these conditions, the
condition `lastIndexOfPrevMatch < text.length() &&
text.charAt(lastIndexOfPrevMatch) != '_'` is never met and causes the
infinite loop.

Thanks,

Maxime Wiewiora

Reply via email to