Re-using font objects

2021-08-02 Thread Christopher Schultz

All,

Is it possible to re-use a PDFont object?

We have a situation where we are building many PDFs in a single process 
and encountering OOMEs running out of heap space. We are using a "large" 
(relatively speaking) TrueType font (ARIALUNI.TTF) whose on-disk 
representation is 22 MiB.


We are loading the font into each document like this:

  PDType0Font.load(document, inputStream)

This font is only loaded when we get a conversion error into the 
standard built-in fonts (we are trying to create the smallest PDF 
documents possible for several reasons).


Every OOME stack trace I've seen includes this PDType0Font.load() call, 
so I was thinking that maybe we'd load the font a single time on startup 
and re-use it for every document which needs it, but I don't see any way 
in the API to do such a thing.


Are PDFont objects possible to re-use? Even theoretically? It would be 
really great if we could do something like this:


static PDFont bigFont;
static {
  bigFont = PDType0Font.load(null, inputStream);
}

public void generateDocument() {
  ...

  PDFont localBigFont = document.addFont(bigFont);

  PDPageContentStream content = ...;
  content.setFont(lodalBigFont);
  ...
}

I'm currently using PDFBox v2.0.8 but upgrading to a later version 
should not represent too much of a problem.


-chris

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Examples of word-wrapping

2019-05-17 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hello,

It occurs to me that my code is doing more work than necessary to
print paragraphs, and that maybe manual word-wrapping is not
necessary. Are there any examples that show how to print a paragraph
of text without having to compute the width of text and manually chop
things up into lines? Or is that part of the price of PDF?

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlzevQQACgkQHPApP6U8
pFiu9Q//YpMB9rLYl69Gm/lFY2kEk10hWt1aDSMxB6sw3xjQAqO05P/2X5JiljoV
YlS1yTbVF9agAxk3l45X4aTubA7ae7oBxNPsIl98AvK3fxzHHlloCNOjc6y2Tdq2
7TxVNEJAqDNaWmHRENaNaqz7I+II3iQTDSu1ycf/MIYCd7sT3SnnlIgzO06E9SNi
a/AiFEgrce5NobfoAt/wZYfTY6ydY+xYWFifZgp3hqWpNBx1BigCvmFs45AnLHm3
rt3Qbsn1Q94m+SivsCMprhVDtNFESgE+5yLrQPtOXVMJrmNKTChgVK4VnDgc8UEz
hN+JFxGtjCuV/SgypWOmv7aar2o10o0AJyhk6zjA0YBTIHaBeqLzIT64sl9I00iv
Cs+Hex8btPvDDycRDIRE76C+ZS1+obdyIf4nbpfEfDQwDjhRHtiGn0M/QXL/se8L
DRz1Vk/8z3IJ26DGulxOb9X1g2GHn7WlPfqZogQMawQtfIBexobWkCZnaQ17ew7R
CXK2pReJ0JZLu6VnmfZWlHEcdQK5ZubSErvBjIO4qMZTzanQUXhTSwRw5vh2hbOG
w3oWvF1DPXTRg+bbsW8UsyaR8CFBpXZg/bE2AjUkXIzoYTujekwe9VGdoufZCqlu
0eGtT9A41PbhilyWh+gA/D2o+dcWTNPWBLrqDYrbfq0TIVYu1fc=
=SUN1
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Determining why a PDF is large

2019-05-16 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

All,

A simple tweak to the getFullUnicodeFont method to cache the loaded
font made a huge difference. The resulting file is now only 20% of the
original size when not embedding the same font over and over again.

Just so I have things sorted in my own mind: each font used will still
show on each page where it's used, right? In the "smaller" file, I can
still see the font mentioned on more than one page, but it's got the
same "CID" and the same font name ("AAAROV+ArialUnicodeMS" -- no more
"AAA???+ArialUnicodeMS" coming up multiple times with slightly
different names).

Of course, I'm also seeing the Type1 fonts show up repeated on
multiple pages as well -- that's normal, right?

Thanks,
- -chris

On 5/16/19 16:06, Christopher Schultz wrote:
> Tilman,
> 
> On 5/16/19 12:17, Tilman Hausherr wrote:
>> PDFDebugger.
> 
>> Look at the resources. If the same font occurs several times,
>> then you did something wrong. It should occur only once in a
>> document.
> 
> Okay, it looks like it is indeed showing multiple times. Here's
> what I can see in the document:
> 
>> Page 1 Contents MediaBox Parent Resources (1) [8 0 R] Font (12)
>> [15 0 R]
> F1 (6) [19 0 R] /T:Font /S:Type0  (AAAGXI+ArialUnicodeMS) F10 (4)
> [28 0 R] /T:Font /S:Type1 (Times-Italic) F11 (6) [29 0 R] /T:Font
> /S:Type0 (AAABJI+ArialUnicodeMS) (9 more listed: 3 total type 1
> fonts, 9 total type 0 fonts including those above) The font
> AAA???I+ArialUnicodeMS shows up for all of the "type 0" entries .
> 
>> Page 2 [...] Resources Font (3)
> F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman) F2 (6) [31 0 R]
> /T:Font /S:Type0 (AAAYGI+ArialUnicodeMS) F3 (4) [28 0 R] /T:Font
> /S:Type1 (Times-Italic)
> 
>> Page 3 [...] Resources Font (2)
> F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman) F2 (4) [28 0 R]
> /T:Font /S:Type1 (Times-Italic)
> 
>> Page 4 [...] Resources Font (2)
> F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman) F2 (4) [28 0 R]
> /T:Font /S:Type1 (Times-Italic)
> 
> So perhaps I am even using the built-in fonts incorrectly if they
> are being mentioned on every page. Or is each page which uses a
> font expected to have its own Font entry in the resources?
> 
> Does this mean I am "adding" the font too many times somehow?
> 
> My code looks like this:
> 
> private void writeWrappedText(PDFont font, int fontSize, String 
> text, Color color) throws IOException { int paragraphWidth = 500; 
> boolean indented = false;
> 
> String strippedText = sanitizeString(text); int start = 0; int end
> = 0; int wrappedLineCnt = 1;
> 
> if(!isAnsiEncoding(strippedText)) { if(logger.isDebugEnabled()) 
> logger.debug("Text contains non-ansi characters: " + text);
> 
> font = getFullUnicodeFont(); }
> 
> for ( int i : getPossibleWrapPoints(strippedText) ) { float width
> = font.getStringWidth(strippedText.substring(start,i)) / 1000 *
> fontSize; if ( start < end && width > paragraphWidth ) { if
> (wrappedLineCnt == 1) setOffsetX(getOffsetXforMargin()); 
> printSanitizedLine(font, fontSize, 
> strippedText.substring(start,end), indented ? _pageIndent : 0,
> color); wrappedLineCnt++; start = end; } end = i; } if
> (wrappedLineCnt == 1) setOffsetX(getOffsetXforMargin()); // Last
> piece of text printSanitizedLine(font, fontSize, 
> strippedText.substring(start), indented ? _pageIndent : 0, color); 
> }
> 
> The getFullUnicodeFont method is:
> 
> private PDFont getFullUnicodeFont() { if(null == _doc) throw new
> IllegalStateException("Document has not yet been created; cannot
> load a new font");
> 
> InputStream in = null; try { String fullUnicodeFontFile =
> "/resources/fonts/ARIALUNI.TTF" ; in =
> getClass().getResourceAsStream(fullUnicodeFontFile); if(null ==
> in) throw new MissingResourceException("Cannot load font file " +
> fullUnicodeFontFile, this.getClass().getName(), 
> fullUnicodeFontFile);
> 
> PDFont font = PDType0Font.load(_doc, in);
> 
> return font; } catch (IOException ioe) { throw new
> RuntimeException("Cannot load font", ioe); }
> 
> }
> 
> Re-reading that code, it's obvious that I should be storing the
> font once loaded and re-using it. I'm guessing that 
> PDType0Font.load(PDDocument,InputStream) doesn't recognize that
> the font has already been loaded and just adds it a second (or
> third, etc.) time. Can anyone confirm that?
> 
> I know that my code isn't the best in terms of only choosing to
> render certain glyphs in this "full" font. I am working to improve
> that, and I know there is example code for choosing the "best" font
> for ea

Re: Determining why a PDF is large

2019-05-16 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 5/16/19 12:17, Tilman Hausherr wrote:
> PDFDebugger.
> 
> Look at the resources. If the same font occurs several times, then
> you did something wrong. It should occur only once in a document.

Okay, it looks like it is indeed showing multiple times. Here's what I
can see in the document:

> Page 1
  > Contents
  > MediaBox
  > Parent
  > Resources (1) [8 0 R]
> Font (12) [15 0 R]
  F1 (6) [19 0 R] /T:Font /S:Type0  (AAAGXI+ArialUnicodeMS)
  F10 (4) [28 0 R] /T:Font /S:Type1 (Times-Italic)
  F11 (6) [29 0 R] /T:Font /S:Type0 (AAABJI+ArialUnicodeMS)
(9 more listed: 3 total type 1 fonts, 9 total type 0 fonts including
those above)
The font AAA???I+ArialUnicodeMS shows up for all of the "type 0" entries
.

> Page 2
  > [...]
  > Resources
> Font (3)
  F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman)
  F2 (6) [31 0 R] /T:Font /S:Type0 (AAAYGI+ArialUnicodeMS)
  F3 (4) [28 0 R] /T:Font /S:Type1 (Times-Italic)

> Page 3
  > [...]
  > Resources
> Font (2)
  F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman)
  F2 (4) [28 0 R] /T:Font /S:Type1 (Times-Italic)

> Page 4
  > [...]
  > Resources
> Font (2)
  F1 (4) [20 0 R] /T:Font /S:Type1 (Times-Roman)
  F2 (4) [28 0 R] /T:Font /S:Type1 (Times-Italic)

So perhaps I am even using the built-in fonts incorrectly if they are
being mentioned on every page. Or is each page which uses a font
expected to have its own Font entry in the resources?

Does this mean I am "adding" the font too many times somehow?

My code looks like this:

private void writeWrappedText(PDFont font, int fontSize, String
text, Color color)
throws IOException
{
int paragraphWidth = 500;
boolean indented = false;

String strippedText = sanitizeString(text);
int start = 0;
int end = 0;
int wrappedLineCnt = 1;

if(!isAnsiEncoding(strippedText)) {
if(logger.isDebugEnabled())
logger.debug("Text contains non-ansi characters: " +
text);

font = getFullUnicodeFont();
}

for ( int i : getPossibleWrapPoints(strippedText) ) {
float width =
font.getStringWidth(strippedText.substring(start,i)) / 1000 * fontSize;
if ( start < end && width > paragraphWidth ) {
if (wrappedLineCnt == 1)
setOffsetX(getOffsetXforMargin());
printSanitizedLine(font, fontSize,
strippedText.substring(start,end), indented ? _pageIndent : 0, color);
wrappedLineCnt++;
start = end;
}
end = i;
}
if (wrappedLineCnt == 1)
setOffsetX(getOffsetXforMargin());
// Last piece of text
printSanitizedLine(font, fontSize,
strippedText.substring(start), indented ? _pageIndent : 0, color);
}

The getFullUnicodeFont method is:

private PDFont getFullUnicodeFont() {
if(null == _doc)
throw new IllegalStateException("Document has not yet been
created; cannot load a new font");

InputStream in = null;
try {
String fullUnicodeFontFile = "/resources/fonts/ARIALUNI.TTF"
;
in = getClass().getResourceAsStream(fullUnicodeFontFile);
if(null == in)
throw new MissingResourceException("Cannot load font
file " + fullUnicodeFontFile, this.getClass().getName(),
fullUnicodeFontFile);

PDFont font = PDType0Font.load(_doc, in);

return font;
} catch (IOException ioe) {
throw new RuntimeException("Cannot load font", ioe);
}

}

Re-reading that code, it's obvious that I should be storing the font
once loaded and re-using it. I'm guessing that
PDType0Font.load(PDDocument,InputStream) doesn't recognize that the
font has already been loaded and just adds it a second (or third,
etc.) time. Can anyone confirm that?

I know that my code isn't the best in terms of only choosing to render
certain glyphs in this "full" font. I am working to improve that, and
I know there is example code for choosing the "best" font for each
character in a string, which I'll be reviewing separately.

Thanks,
- -chris

> Am 16.05.2019 um 18:09 schrieb Christopher Schultz: All,
> 
> We have a process that generates PDF documents usually using the 
> default Type-1 built-in fonts, so the documents do not embed the
> font information.
> 
> We recently added the ability for the documents to include font 
> information if certain glyphs were not available in the default 
> font(s) and, as expected, the file sizes end up being bigger when
> that happens.
> 
> What is the best tool to look at a particular document to see why
> it ended up be

Determining why a PDF is large

2019-05-16 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

All,

We have a process that generates PDF documents usually using the
default Type-1 built-in fonts, so the documents do not embed the font
information.

We recently added the ability for the documents to include font
information if certain glyphs were not available in the default
font(s) and, as expected, the file sizes end up being bigger when that
happens.

What is the best tool to look at a particular document to see why it
ended up being so large? I'm not sure I can visually tell by looking
at the document which character triggered the inclusion of the font,
and then why that font was used for what I can only assume was a lot
of text. By inspecting the file, I'm sure I can improve my code so
that we have fewer uses of this additional font and therefore keep the
file sizes to a minimum.

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlzdi0MACgkQHPApP6U8
pFgi1BAAontGI4xIb2LSkueFR1NeIeoDUzrMTF2r+3136G3gTmX+dvfKjaH5eHjw
qa2Nl5Z7GflszlPqSuGtSjHkKA+fejSUj9DfHx55Uef89lTjPJGh7r7Y15yr0nu0
oI595m25IjP6QUsA//uHknjcazuGEyjJS8M3ractEUukwQJmVCgdpXjCjca5Bc+5
Vhxp+iim4Vsv8Enckc2f5MFmFSTTj+Gi5qhM1m1vxyrTis2np1/mUVtlFgH50/Nx
WS2WvIv9RKmnx0Wo0SvrhpwSlJ1pDbU8bbx0lvLXBuyyPzQ6KdHpw++onBleA6Nb
bM+Axs9r5sMWjWhCX5vKLMcQN7jZU/yDYLAPDNI0a5pPFWyG7xbRDwnQo0fLu0vC
E4N5RbFxbjyKdBAA4LVggfEjE5kdDCUL0utH38RaFu2XUTcTjrZUXh9hylqVb4xl
i1Mdenq8gUsMvldxR1DCoQTDCuxzAa+tB3JxDt7E6XnrOtIqgJdryl1wruCtHkVT
UL71AMHvc7MCbzE2wS6582kjilCWVRZkYph0UPbLFDZ7PDdvVSukYI8erXhRS/Eu
SvFLLOmKuc/OQSUAiVEgj9d52+IsvQzEsiSYET/77JIAO+yKmnvFUm0KiK3WKSHG
36KcSkX9EWJeXx4XUcRnbkvX2ypsTaqqN7xnklAL4ohnouBmFFc=
=46RA
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-20 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 3/20/19 03:55, Tilman Hausherr wrote:
> Am 19.03.2019 um 22:08 schrieb Christopher Schultz: Tilman,
> 
> On 3/19/19 16:23, Tilman Hausherr wrote:
>>>> Am 19.03.2019 um 19:45 schrieb Christopher Schultz: Tilman,
>>>> 
>>>> So I'm starting to look toward making my code better now that
>>>> it's actually working. Right now, my code looks like this:
>>>> 
>>>> if(!isAnsiEncoding(strippedText)) { font =
>>>> getFullUnicodeFont(); }
>>>> 
>>>> Where one font simply replaces the other for strings that
>>>> aren't available the the built-in font(s).
>>>> 
>>>> I'd like to support emoji and stuff like that. I can find a
>>>> font (or fonts) for that, but I think the only way I can do
>>>> that with the existing API is something like this:
>>>> 
>>>> Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };
>>>> 
>>>> for(Font font : fonts) { try { page.setFont(font); 
>>>> page.showText(text); } catch (IllegalArgumentException iae) {
>>>> // Try the next font } }
>>>> 
>>>> That will "work" but it will not work if, for example, I need
>>>> to print text that includes both Chinese characters (from
>>>> arialUnicode font) and also emoji (from the hypothetical
>>>> "emoji" font).
>>>> 
>>>> If there any way to tell PDFBox to "pick the right font (from
>>>> some list) for each character"?
>>>> 
>>>> 
>>>>> No, that is why I created the EmbeddedMultipleFonts.java
>>>>> example which I mentioned earlier in the thread. That one
>>>>> can switch within strings.
> Right, it basically does the same thing as I have above, but for a 
> bunch of increasingly-widening substrings, and it uses exceptions
> for flow control. Yuck.
> 
> I'd have to look more into what PDFont.encode does, but I'm
> guessing that it wouldn't be too hard to build methods into the
> PDFFont class that look something like this:
> 
> /** * Returns true if this PDFont can render the whole string. */ 
> public boolean canEncode(String s);
> 
> /** * Returns the longest String that can be successfully encoded
> by this * PDFont, beginning at the beginning of {s}. If the whole
> String {s} * is encodable, then {s} will be returned. If only a
> part of {s} * is encodable, then the return value of this method
> will be such that: * *
> s.startsWith(getLongestEncodablePrefix(s)) == true * * * If the
> first character of the string is not encodable in this PDFont, * an
> empty string (or null?) will be returned. */ public String
> getLongestEncodablePrefix(String s);
> 
> 
>> That would just push what you called "Yuck" further downwards, or
>> we would have to maintain code twice, one for checking whether
>> something can encoded, and one for actually doing it. And this
>> for all the 6, maybe 7 font types.

Code reuse?

>> Instead of going forward with your project with the working code 
>> provided, you're arguing about design issues.

You are operating under the impression that I haven't already modified
my own code to work. I have.

I'm volunteering to help improve your product. You don't have to get
so upset when someone offers help.

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlySVXQACgkQHPApP6U8
pFgBRQ/+NR6U1Btl12Oof9fM4tn77UNUgQ7qVPmrsW4ev/He1J/TlqNXcxUGhnG6
ZYZYlrjCmzLQ9oB2mMqfuG55gN/FPziYZwegVDFiU1O/40Rsdan1aW5BQnM14qWN
z1+kBW0awOABdguMvpwjsMaGpxVFBMdMeHsxVQmmMD8LozOOuI2yJBEvCna8mvqS
iFiPUC53sIxdTAKvnFZHIUoDYLlXTuuwd28gbJSDC+6G6YpeF+aRBqUj0vqc2bfk
9abJ4BZYOztysPrc/NWE97HBLxsYIhROZGsdVUTVhs8VgBsdzG7qXg9UhrWzTYPy
YdtrldUFxb1WuJ/UQZZIPlAikPwlbI6S45Hzy1YlnBkWa8vqR4f0QLh3X458Zzxc
/ZF+CbKaNe/BWDkBANZANmUf1TjArnIQp5jo4QsYgq2m6BfTbLeMfYDTRap92NpA
M3kJQ0fU8gl39VWKk6DubeOWdkD+o/BusN/gOpg4z3YINH2TeHIf1w1u6k+lsg6B
fGg4e71Hg556LkuT5eDgChXfMj35PXOVJ6qnhM+HZ2Z2bgY3U+bV/Hnrk9bKOVFa
MlHPt48V/M1/AuTJ4PLBjXp9XNak0vxIRI0YMaUnQ3oZZgabVkG0SPAsdrYwEGuZ
cQyMPMciLQIjQcExVGVwtaUD+ooMDAfQMHHRb9qeBJ0c/E30ung=
=QRFg
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-19 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 3/19/19 16:23, Tilman Hausherr wrote:
> Am 19.03.2019 um 19:45 schrieb Christopher Schultz: Tilman,
> 
> So I'm starting to look toward making my code better now that it's 
> actually working. Right now, my code looks like this:
> 
> if(!isAnsiEncoding(strippedText)) { font = getFullUnicodeFont(); }
> 
> Where one font simply replaces the other for strings that aren't 
> available the the built-in font(s).
> 
> I'd like to support emoji and stuff like that. I can find a font
> (or fonts) for that, but I think the only way I can do that with
> the existing API is something like this:
> 
> Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };
> 
> for(Font font : fonts) { try { page.setFont(font); 
> page.showText(text); } catch (IllegalArgumentException iae) { //
> Try the next font } }
> 
> That will "work" but it will not work if, for example, I need to
> print text that includes both Chinese characters (from arialUnicode
> font) and also emoji (from the hypothetical "emoji" font).
> 
> If there any way to tell PDFBox to "pick the right font (from some 
> list) for each character"?
> 
> 
>> No, that is why I created the EmbeddedMultipleFonts.java example
>> which I mentioned earlier in the thread. That one can switch
>> within strings.

Right, it basically does the same thing as I have above, but for a
bunch of increasingly-widening substrings, and it uses exceptions for
flow control. Yuck.

I'd have to look more into what PDFont.encode does, but I'm guessing
that it wouldn't be too hard to build methods into the PDFFont class
that look something like this:

/**
 * Returns true if this PDFont can render the whole string.
 */
public boolean canEncode(String s);

/**
 * Returns the longest String that can be successfully encoded by this
 * PDFont, beginning at the beginning of {s}. If the whole String {s}
 * is encodable, then {s} will be returned. If only a part of {s}
 * is encodable, then the return value of this method will be such that:
 *
 *   s.startsWith(getLongestEncodablePrefix(s)) == true
 *
 *
 * If the first character of the string is not encodable in this PDFont,
 * an empty string (or null?) will be returned.
 */
public String getLongestEncodablePrefix(String s);

WDYT?

If this must be implemented initially by using exceptions for
flow-control, so be it. But theoretically, it can be improved in the
future by performing faster checks... possibly by each type of PDFont
subclass in a different way.

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlyRWlwACgkQHPApP6U8
pFh9ThAAoHG1hK2SnjLv0ibDvZaG3ZI79NAgoIz7+bowPbi4BvPfKYfuubF0QSNH
l2lvk657H+0PDFUU5UepyB4JsjItXKG3sgNbQBB0E+G84PF896M/3r61TMgTKmT4
1pEqkHMXJoBA/4/Gnh9HLMGyKTY623R60Jhgsxocm78KR4zSjiZuvLpWsSvrqC57
4vR4YZ8Od4FvC0NFiGrI4w7KCpRvhT15IiOS77Qitgm3CMTyDaOulcjrcQx2rk0B
sZY5q+S2huG8INR2vqjjkA/iQjJOTvI7hGJco/PemKWZm6x0/NmATeA7bSYZ9FZ/
ylJgahUKyCh2b/iJG5oOl/7iuFKrBpeO95/KO0ETTgrM/dZLbNnvDqQsdAfBOZYv
MTzqk36rf7vMUZtr4i9XW4la4tol5MZTidUGJBgryhaE4VQDrfsnpI3R78LKJA2a
+QHVLGA5N/fnCyG9/sxX3dwr3+K4daZ56YZJrkaqoO/IU95eQu8sFdATI++4uwsm
JcWGbmK6O7RiljwqrggTJaU49BuPgnj1+RbIxBkovGEM5ReITomqZn5wsUnowbiE
jVxSAavZ7OU8TlT+/bjFKWoV+wTvzGad671vPxt/Dy+++BFiGScVDwLM8qVmcrd1
gf8BosKaVBHE/+YBw1wyYyYJowvrtr7T9gMMyIHG91fZiSv8Ml4=
=6hcu
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-19 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

So I'm starting to look toward making my code better now that it's
actually working. Right now, my code looks like this:

if(!isAnsiEncoding(strippedText)) {
font = getFullUnicodeFont();
}

Where one font simply replaces the other for strings that aren't
available the the built-in font(s).

I'd like to support emoji and stuff like that. I can find a font (or
fonts) for that, but I think the only way I can do that with the
existing API is something like this:

Font[] fonts = new Font[] { builtIn, arialUnicode, emoji };

for(Font font : fonts) {
try {
  page.setFont(font);
   page.showText(text);
} catch (IllegalArgumentException iae) {
// Try the next font
}
}

That will "work" but it will not work if, for example, I need to print
text that includes both Chinese characters (from arialUnicode font)
and also emoji (from the hypothetical "emoji" font).

If there any way to tell PDFBox to "pick the right font (from some
list) for each character"?

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlyRONIACgkQHPApP6U8
pFhu3A//dObTICq7o17gNERfJKQg6dL4nFt8eHXTrw/NZkrSzMtiyYttil+o8a5o
y3bPDQ+Nvo2FofQBFCfq480mZh1Vo8MpVNKTitUISR/14zzNPSTNa+K08bfMMYhA
8El2EgGAv/v/xtn7xFLNowOjbq7r3Hap1wmYpwLVM1aqFYL4wS6QNwlkmIsOqocs
JeeQ247g/KZHm4nJ9Z+b5Dd8vS/DpoOUzs9Yyt9APNHPRAjirevq37ALf46gowDj
GHlIGLzjNDLDLUn6sCFES2SSScHt8und/RW6K5cEJsFmtc22cFZ9RpcpeRg4BkJh
/VPDs8Iq1KzMUXWjlJTq5bWsbE8IMCtgSkYZt0Fl9FJOGrg9aIa6SjEHxZ3KsBht
RHquj3vblGYrrn22t+G+oelIm94iiWfwsIf/wmOke2fcv83lEX5xVMtTKLB+uCQo
4wwMqgkuTQiMS8KH5BlR5WCMrmGhRq4fD3gZ1Sdt4TJXiKJuUOss5sQTdDgLIyvT
jL29R79pCdnp1v90rxM2sFR3CPr/fjUZOcF1+vYKhXwyaYSFboaxCUwtFNoA+aLc
mztEIRurYq6MParoIrELyGaqVnmOD/ElcPiRdbNSWkfa8xRcAjHqeFCjZe6qrTOD
nkbAzhOG4Ty0hyI/v0zaaGvJ1lS40zzaCp0hHxDcd1td3JnUzs4=
=paYu
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-04 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 3/3/19 08:48, Tilman Hausherr wrote:
>> I have no idea. The information about PDFBox seems to be mostly
>> in example programs and not web-based documentation. Searching
>> e.g. Google for "how to use FontBox with PDFBox" generally comes
>> up with references into the Javadoc for "uses of FontBox
>> interface".
>> 
>> The Javadoc does not describe what FontBox is and none of the
>> classes or subclasses in those related packages really have any
>> documentation worth reading. Each class "foo" is described as
>> "being a foo" and each "getBar" method is described as "gets the
>> bar for the foo".
>> 
>> So... discoverability of features is pretty much nil here.
>> 
>> I'm quite happy with the responses I get on this mailing list,
>> but it's nearly impossible to discover on my own what is
>> possible, here. I shouldn't have to get you guys to tell me how
>> to use the software... you have better things to do (like
>> continue to write great software).
>> 
>> Is there a good example of using FontBox with PDFBox in order to 
>> subset a font?
> 
> Yes, the EmbeddedFonts.java example.

I don't see any use of FontBox in the EmbeddedFonts.java example. Am I
missing something?

> We are a small team and don't have the time to write tutorials.
> There are many working examples and also many answers on
> stackoverflow.

Understood.

> You don't need fontbox unless for advanced things, e.g. reuse a
> font for several files. For normal use cases, fontbox remains under
> the hood.
> 
> If you think some class documentation is useless, name it, and I'll
> see if it can be improved.

:)

It's less of a presence of useless documentation and more of a lack of
existing documentation. I can file some tickets if you think it would
be helpful. I also don't mind writing documentation and/or tutorials
for the project.

> The subset thing is done by PDFBox without you having to bother
> about it. It's "not subsetting" that would require more parameters.
> So you need only this:
> 
> PDType0Font font = PDType0Font.load(document, new 
> File("c:/windows/fonts/arial.ttf")); stream.setFont(font, 12); 
> stream.showText("...");

Okay, that's exactly what we are doing (well... we are loading the
font via the ClassLoader, but ...). And it's working. I was just a
little worried about the ballooning file size. I realize there is
little to be done about that at this stage.

At this point, I am basically doing this:

[ When adding text to the document ]
- - If the text contains anything outside of the ANSI encoding
 - then replace the usual (default) font with the ARIALUNI.TTF

It operates on a per-text-string basis, so it should only change the
font for a single piece of text that requires it. I'm starting to
think that I should not bother scanning the text and instead use the
IllegalArgumentException as flow-control -- which I still don't like.
But it means that my code will not spend a ton of time repeating
checks that PDFBox will end up doing, anyway.

I'm a little worried about what I will do the next time I have an
issue like this -- where the ARIALUNI.TTF font doesn't include some
character that I need... since there's no way to probe a font for
support for a code point, I can't map code-points to fonts in a
scalable way. It will just be trial-and-error which is no fun.

It also means that I need to have some kind of set of fonts that we
just round-robin through, hoping we get a hit and we can continue...
otherwise we just have to fail (like we do now).

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx9gDYACgkQHPApP6U8
pFia0w/+LSFIJCLtol+WZDMpcjTxI1Y4ulUFmRJxd+ZdGzbCrKss2R3p+J6VGZ0w
SZWAUQqg48FoVu4kh3fp4j9mz9eqprF9rmZiEPqGJKtsUPnpMTd3SA6Xt2eucY3O
VMOEbsy66/wC3DwgIgQdrrDfuRWsvmLkE6WyvkJpf1+sDIgFkSoD57y3YpHQdB4/
o6+WXg1FSVjQAiND/XYAGZUHmV2o5JGFJVJJNlnmC6m11j/0zZvv4ZS1v3NX4DS1
n9cwHtTEUxcz73AGzUo9A0QLfsPgEMEF8akbaLfA4UekZ0lZLCFXA36aP62KaI6b
ICo1/qF7eEOC1XpdCZS2JWpjMQn83q2kvuIooTEyHXjOT8t27f0+455e3PgYuLkh
kV9xMutmkJxXKv5VO3ohTmDWydQiwt/90M9ToTKonGeYWXTEEWzHpHr6BD95/2rZ
+yAbY3S0vTb1J0uQmlDaK6dd1pU+SSMxIV6Gi1tYi1kMVboiiQAMxJ9eqEhjt21+
W3x4oGPLUoJ6q1TSTh0BOnXVnEUeci/Srbp+GWXvhmXtVC5H9V6dggb94yaKI3nC
KLW+87OYaU+Pd4GQNMI+2KipGAbeQ/8OhHEq63cFoKLzhKk/V/50w3Bo9/CLGyZ3
W0E7lAZWV5cnu/AoKHC9KdSIPf+Qn6c//CtDmyWbjAr8g1yOzZc=
=TScO
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-03 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 3/2/19 10:00, Tilman Hausherr wrote:
> Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
>> Is there a good way to probe text to determine whether or not an
>> alternate font will be necessary and only load/bundle it then?
> 
> From the new EmbeddedMultipleFonts.java example (in the source
> code download):
> 
> 
> boolean isWinAnsiEncoding(int unicode) { String name = 
> GlyphList.getAdobeGlyphList().codePointToName(unicode); if
> (".notdef".equals(name)) { return false; } return
> WinAnsiEncoding.INSTANCE.contains(name); }
> 
> 
> When that one returns true, you can use the built-in fonts.

Okay, I see that. Is there any reason not to do this?

boolean isWinAnsiEncoding(int unicode)
{
return WinAnsiEncoding.INSTANCE.contains(unicode);
}

?

Is there nothing like PDFont.isSupportedCodePoint(unicode) available?
I didn't see anything. It looks more like the standard way to check is t
o:

try {
  page.showText(text);
} catch (IllegalArgumentException iae) {
  page.setFont(alternateFont);
  page.showText(text);
}

If that's SOP, then maybe there is no real reason to bother checking
whether the String will work in the first place... just try it and try
again if the operation fails?

Catching IllegalArgumentException seems ugly, though. Maybe PDFBox
could subclass IllegalArgumentException with something more narrow
like IllegalCodePointException and throw that instead? It would be
backward-compatible and also one could determine the root cause
without parsing the exception message to see what the problem was.

I'm happy to provide a patch.

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx72rsACgkQHPApP6U8
pFho0Q//cH4SX5tWsb/JX782EJ622/h3XCumnrWuMT/yiunSyinsd26Jz3tquxU9
/tL9hZ8a57j20dKoqf5vm8EorlpYBrSgNAOjlRxuKqY2CLdnA9EsWX9Uux7R5PjF
FUeE8yKGRyycUBazfNm0Ijv4oZt7A26/irmZrKUwbx73gbIxJMggFGQoMiAWMwgM
hoX4MeJiBdxmJYf/XnHVZJs1LBX9pDnizIHEU26/bK7B2wb3H2+PSWe4TKf0eb7v
n1UVjX+12U+CzlF9kx4AnMSDaTo3zmCxSQbzygOqVmaQsc2yAk7mksb7Tt79JzZ/
s1aatZRtmLEuRhbrF8knt3oWlat4Z1KKQD/Onol3pX+CQ/vKVmFgp9TLBitkiOm+
CZC949jfg3386akxeixQxBNLxMoo826NYfNLzKb6x0rYSnz4mgqyrvEPzEw/CltT
Sn7Fo5RSvMH1aCa45KoPmQzCE0okUQN74XaqGaob6pFuerlHcYxhS/DefP+QtO93
ZRxWyGMJMw81+AEk7eIBeLVxh4gTCdA2bOJwR4I4n5oJZi0VCXOLy8p6wBlQrvDx
rtRhcHW/HidVeiOeQ9kYoEDqAbg6Rvc4Wi/TkM0LxgeV0d/D9YW+gUWFw3NyiiNk
IONjKQBxKpowgzXsq0Ug/DcKGu/Za7De9tp0jD5MVZU9i3e96Ag=
=bMpZ
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-03 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

John,

On 3/2/19 10:16, John Logan wrote:
> Christopher, is the font that you don’t want to embed a Type 1
> font, or a TrueType font?

I'm using PDType0Font.load to load the font, but the file is TT:

ARIALUNI.TTF: TrueType Font data, digitally signed, 20 tables, 1st
"DSIG", name offset 0x161c2f0

> If the latter, could you use Fontbox to subset the font and keep
> the file size small?

I have no idea. The information about PDFBox seems to be mostly in
example programs and not web-based documentation. Searching e.g.
Google for "how to use FontBox with PDFBox" generally comes up with
references into the Javadoc for "uses of FontBox interface".

The Javadoc does not describe what FontBox is and none of the classes
or subclasses in those related packages really have any documentation
worth reading. Each class "foo" is described as "being a foo" and each
"getBar" method is described as "gets the bar for the foo".

So... discoverability of features is pretty much nil here.

I'm quite happy with the responses I get on this mailing list, but
it's nearly impossible to discover on my own what is possible, here.
I shouldn't have to get you guys to tell me how to use the software...
you have better things to do (like continue to write great software).

Is there a good example of using FontBox with PDFBox in order to
subset a font?

- -chris

>> On Mar 2, 2019, at 7:00 AM, Tilman Hausherr
>>  wrote:
>> 
>> Am 02.03.2019 um 15:54 schrieb Christopher Schultz:
>>> Is there a good way to probe text to determine whether or not
>>> an alternate font will be necessary and only load/bundle it
>>> then?
>> 
>> From the new EmbeddedMultipleFonts.java example (in the source
>> code download):
>> 
>> 
>> boolean isWinAnsiEncoding(int unicode) { String name = 
>> GlyphList.getAdobeGlyphList().codePointToName(unicode); if 
>> (".notdef".equals(name)) { return false; } return 
>> WinAnsiEncoding.INSTANCE.contains(name); }
>> 
>> 
>> When that one returns true, you can use the built-in fonts.
>> 
>> Tilman
>> 
>> 
>> 
>> -
>>
>>
>> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> 
> -
>
>
> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx72L8ACgkQHPApP6U8
pFieHQ/8CGlWfRwCNzFdZOLIz/bgquqSsVlGMSYwdMv3+Ytl5WJ8vvJj1az/YNVE
yyIXVKWWVa1aQiEMX+wEXZIhcLX1YROireFYkC6IwQaCjlfLtPTPopjwehVTfnN7
M5Fk23Rfge+Eths9alRm82hLgoKnYO70bYWfAWeYXokjPUXcQokfyG7N3CkWYaZa
Ljt8fihDGbk266v7wPwbiRef58F3NW1EfSFV4J8qFr/bOiLZsRXGY2UXe4/k6Fxn
qGSMqnV76CwWWXSYp4saKG0kAija37huAooYhksWAOO12WPJbOtCVD3C6veS/R8M
RFXOb9z9uT/yratN7KGDxuWKT28YXaoFPzJfLwx1ZOiDZCK3E39xG8d7/dqiAFrb
Edc4mBxK0wz9Ew6B1zReOG3d3kP7ksYEUsMwtLltfz4LSj17dzTuWaMCV5EQ0FRx
8oFm7xiPXBNwA8tNj/+US81jGV2u2pwxcKUi8LEygJzp7qjw5RsIQMrXUq450NWE
LKIPqUE3I8iIpCqST1IX6qMSKgUpYyKi9nTxjMXIjNL6j9kA91fzsZLluBRm2vCs
+jAgcVRImSrQ2wa0ZFtTEf3xQpEorkELgN1KhVkVRLllkisVmdqY026z7KfWwwP7
YsKBs6Si/ZOrDQO5gxlzXZIcE8AO54X7vh5V+IfKVsN+n6fwW9E=
=jXC9
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Choosing a font for non-ASCII characters

2019-03-02 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Andreas,

On 1/31/19 01:27, Andreas Lehmkuehler wrote:
> the standard pdf font (PDType1Font.HELVETICA et. al.) don't
> support anything else than (limited) latin1. You have to use
> something else.
> 
> Have a look at the HelloWorldTTF example [1]. It shows how to embed
> a true type font. You have to choose a suitable font from your OS
> or something like the noto fonts from google.
> 
> W.r.t. font embedding. It's always a good idea to embed all
> resources which are needed to render a pdf. PDFBox reduces the
> amount of space as it limits the embedded font to the used
> characters.

Thanks for the pointers. I'm finally getting around to doing something
about this. I used "Arial Unicode" as referenced in a quickie online
tutorial[1] and what I'm finding is that:

1. The Chinese characters render correctly (yay!)
2. My English-only file has gone from ~1k to ~18k

This test file was the simplest I could muster so it's really an
unfair comparison at this point.

But it's clear that the file will get bigger (of course) by adding the
font.

I'd like to avoid bundling the font unless it's necessary. For several
months, we've been able to get away with the standard PDF default
fonts (which, presumably, the PDF spec requires all clients to provide
which is why the files can be so small). Is there a good way to probe
text to determine whether or not an alternate font will be necessary
and only load/bundle it then?

Thanks,
- -chris

[1] http://www.kscodes.com/java/write-chinese-pdf-using-apache-pdfbox/

> Am 31.01.19 um 02:56 schrieb Christopher Schultz: Hello,
> 
> We are using PDFBox to generate PDFs in a very simple way and only 
> including fonts available from the PDType1Font class (e.g. 
> PDType1Font.HELVETICA). The PDFs we are generating are really only 
> including a few title/subtitles, text, and bulleted/numbered
> lists.
> 
> Everything is fine when we use what is probably in the standard
> Latin alphabet, and we've had some troubles with special characters
> that don't fit in there such as ≥ and ≤. We've dealt with that by
> simply replacing "≤" with "<=" and so on, but we're starting to use
> languages that don't use Latin script and so we can no longer
> replace out way out of the problem.
> 
> For example, I need to be able to put Chinese characters into a PDF
> we generate. So let's take the text "中國" which is just the word
> "China" in Traditional Chinese script.
> 
> First, how can I find out that the character isn't going to fit
> into the font that I'm currently using? Should I do it for every
> character we try to put into the page, or should we just catch
> exceptions when we try to write the text to the page and then scan
> at that point? I'm trying to avoid writing hideously inefficient
> code to handle these situations.
> 
> Second, once I know that I need to choose another font... how do I 
> know which font to choose? Should I keep a mapping of Unicode code 
> point ranges and the best fonts to use for them?
> 
> Finally, what fonts are actually available to PDFBox? How do I add
> new ones? I have a lot of control over the environment and I get to
> see failing conversions and intervene, so some trial and error is
> okay for each new situation.
> 
> The recipients of our PDFs are file-size sensitive, so I'd only
> want to include (bundle) a font in a PDF if it was absolutely
> necessary to include the font itself. If we can get away with
> including a *reference* to the font in the PDF and telling these
> recipients "sorry, if you want to read the Chinese PDFs we send,
> you'd better make sure you have font X installed" then that's okay
> with me, too.
> 
> What suggestions to people have for doing all of the above?
> 
> Thanks, -chris
>> 
>> -
>>
>> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> -
>
> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlx6mRwACgkQHPApP6U8
pFjWYQ/+JqfHbkkJ4ML+uxduY4PIJqY7u+FC1lsbVvbVjIhi1rLCQRuNDUWnpkmz
bSfwCoDOevamegryFFxH/I4Ok+v8TXmBUEnAeEOFtHGlWHDuNXcijxmlFRKdpjIi
MFzqv8t+4+YY6dS4KyHr4+fhj57sSqRkGVrKAYANonx3z/nEn/X7PqOnY1seDrEJ
QGB/09y36+58E6TI+65resE181nvYFcw5kqchFWIjziwH654gldLQCojZ15GS5+/
PylDx5f6n/pxPYJLX940zEDjfqR4FCQryuzo1Yf3xM96c1IMYJbViv/LWrz+lQnc
+7PPK99oV

Choosing a font for non-ASCII characters

2019-01-30 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hello,

We are using PDFBox to generate PDFs in a very simple way and only
including fonts available from the PDType1Font class (e.g.
PDType1Font.HELVETICA). The PDFs we are generating are really only
including a few title/subtitles, text, and bulleted/numbered lists.

Everything is fine when we use what is probably in the standard Latin
alphabet, and we've had some troubles with special characters that
don't fit in there such as ≥ and ≤. We've dealt with that by simply
replacing "≤" with "<=" and so on, but we're starting to use languages
that don't use Latin script and so we can no longer replace out way
out of the problem.

For example, I need to be able to put Chinese characters into a PDF we
generate. So let's take the text "中國" which is just the word "China"
in Traditional Chinese script.

First, how can I find out that the character isn't going to fit into
the font that I'm currently using? Should I do it for every character
we try to put into the page, or should we just catch exceptions when
we try to write the text to the page and then scan at that point? I'm
trying to avoid writing hideously inefficient code to handle these
situations.

Second, once I know that I need to choose another font... how do I
know which font to choose? Should I keep a mapping of Unicode code
point ranges and the best fonts to use for them?

Finally, what fonts are actually available to PDFBox? How do I add new
ones? I have a lot of control over the environment and I get to see
failing conversions and intervene, so some trial and error is okay for
each new situation.

The recipients of our PDFs are file-size sensitive, so I'd only want
to include (bundle) a font in a PDF if it was absolutely necessary to
include the font itself. If we can get away with including a
*reference* to the font in the PDF and telling these recipients
"sorry, if you want to read the Chinese PDFs we send, you'd better
make sure you have font X installed" then that's okay with me, too.

What suggestions to people have for doing all of the above?

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxSVeMACgkQHPApP6U8
pFgQew/8CS1YmJs27QrD+WGV/Zcn2RAeG/ZVs5w3huMwKLY8NfXQ4Vdp3o+s+B7u
2wn9m2LJVXuWT2dfDDQzZDIfBgfqZI5sl4+hBDSos9gEVV3ddWcox1A0YSTCy5VW
DAlDZSscEdIDyMIVz2E1dQi6/p35MrSyJ/Xom6Tbnvt3ZHAp87GHZ1rB8XXrtVZS
itVE756hJ59o4tZJoM9cH1NH1w9PuLLJyrGpCsc1oTgcZTI0jXxiIC9Q4GvLbLbO
yVdExITzTVflLAo0BRGOJkb5IF1OyVf51HHas1+DMEvtSXY5J89e1dFnyo1dFxMU
MXJ5rKh/FQvJtC5Lf9QoQ3tV8r3qyWv0wc8FVgMcLUA9DHbx7QtcydQwoKf3poJz
ymlOJWH2b4d5uLbSfdjr9Nof4IRNH504cwjoth3eor3Ra/SCaem2ZrTQhY6XzoF1
vCpZChDIKzDvI7NDGbcaNvzzezNmlbdRdh3Ekwk1E/vwfrmtb4VmW7sW9PICP1o6
80sqydy6qIMtQNjr1EK55VIvD4+e10SwYWhcZinsByQkYZpoRjKWQ9kTNk10vvwk
cLB8bVeLPHC7nLe4FqJe4y3+hWBfGP25O2VdnNU1sjd4lbzQhNIgCMj0n+6ziDuU
Nh9vDuKRXEIIXHZUxrN2Td3hOw96wKHqEQ8RtxYpuGWABx4wIWw=
=aMPi
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: PDFBox JPEG2000 and Tomcat - Revisited

2018-08-01 Thread Christopher Schultz
Joel,

On 8/1/18 3:54 PM, Joel Hirsh wrote:
> However, your comment that "Tomcat doesn't unload libraries" is not
> entirely true. There is an explanation of such symptions at
> https://haraldk.github.io/TwelveMonkeys/, under the section "Deploying the
> plugins in a web app".

Thanks for posting that. Lemmie 'splain.

In a servlet container, web applications (aka "contexts") each have
their own ClassLoader and if the application is undeployed or
re-deployed, then that ClassLoader will be freed for GC. If a class
loaded by that webapp's ClassLoader is still in use, you have a memory
leak situation knows as a "pinned ClassLoader".

It happens all the time when you use a system-global cache such as
ImageIO does. This can happen with JDBC DriverManger and a handful of
other common APIs and it can be a serious problem for servers where
there are lots of redeployments. You can also start to see errors like
"ClassCastException: Cannot cast foo.bar.Class to foo.bar.Class" because
instanceof-ness is defined not just by Class but also by ClassLoader.

The IIOProviderContextListener mentioned in that documentation is one
strategy to handle the unfortunate way ImageIO caches classes. Another
strategy is to put the library into a place where it is accessible to
all web applications instead of just one.

> Just putting jars in a shared folder did not help.

Here is where asking on the Tomcat mailing list *will* be a good idea.
Make sure to give your exact Tomcat version and explain where you put
the library.

Can you post the *entire* stack trace from the failure? It looks like
it's incomplete -- it doesn't seem to have a "true root cause". It just
says "Could not initialize class" because "Could not initialize class".
Somewhere there should be a real root cause like NullPointerException or
something that can actually be debugged. Def post that to the Tomcat
mailing list as well if you post.

-chris



signature.asc
Description: OpenPGP digital signature


Re: PDFBox JPEG2000 and Tomcat - Revisited

2018-08-01 Thread Christopher Schultz
Joel,

On 8/1/18 1:38 PM, Tilman Hausherr wrote:
> Am 01.08.2018 um 17:42 schrieb Joel Hirsh:
>> And what appears to be the same error is back.  Running one JPEG2000
>> image
>> is fine, but at some point I get the error
>>
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.pdfbox.jbig2.JBIG2ImageReader
>>    at
> 
> 
> JBIG2ImageReader is for JBIG2 images, not for JPEG2000 images.
> 
> I suggest you ask the same on the tomcat mailing list, maybe they can
> help... sadly I don't know more than last time.

... and before you do, just know that Tomcat doesn't unload libraries...

-chris



signature.asc
Description: OpenPGP digital signature


Re: How can a java class load a static pdf file in WebLogic 12c?

2018-04-19 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Shawn,

On 4/19/18 1:05 PM, shawn.oplin...@gmail.com wrote:
> 
> 
> On 2018/04/19 14:41:18, Christopher Schultz
> <ch...@christopherschultz.net> wrote: Fabio,
> 
> On 4/19/18 10:26 AM, Fabio Salvi wrote:
>>>> Hallo Shawn
>>>> 
>>>> I use something like this:
>>>> 
>>>> InputStream resourceAsStream = 
>>>> getClass().getClassLoader().getResourceAsStream("/META-INF/pdfforms
/"
>>>>
>>>> 
+ *aPDFFormularName*);
>>>> 
>>>> PDDocument pdfDocument =
>>>> PDDocument.*load*(resourceAsStream);
>>>> 
>>>> This inside an EJB but I believe it will work for a WAR as
>>>> well
> 
> Yes, it will work inside of anything where the ClassLoader can get
> to the file. Sometimes it makes more sense to use the servlet's 
> getResource() method because it may have a wider selection of
> storage locations to consult.
> 
> A few things:
> 
> 1. Make sure to close the InputStream in a finally block. Resource 
> leaks are serious business on servers like this.
> 
> 2. Watch out for your memory settings. You may end up loading a
> *huge* PDF into memory without realizing it.
> 
> 3. Remember that ClassLoader.getResourceAsStream can return null.
> 
> -chris
> 
>>>> 2018-04-19 15:13 GMT+02:00 shawn.oplin...@gmail.com < 
>>>> shawn.oplin...@gmail.com>:
>>>> 
>>>>> 
>>>>> I need to load a static PDF document, from a java class,
>>>>> running in my J2EE web app on WebLogic 12c; however
>>>>> although my code works in Tomcat, when trying to run it in
>>>>> WebLogic 12c (WebLogic Server Version: 12.2.1.2.0), I get a
>>>>> server error that the PDF file cannot be found (
>>>>> java.io.FileNotFoundException).
>>>>> 
>>>>> I am using Apache's PDF library, PDFBox version 2.0.8 to
>>>>> load a fillable PDF file that I created, and then populate
>>>>> that fillable PDF with data.  My code works fine in Tomcat,
>>>>> but fails to find the pdf file when deployed to WebLogic
>>>>> 12c .
>>>>> 
>>>>> -This appears to be because when an EAR file is deployed
>>>>> to WebLogic 12c, the contents in the WAR file (all of the 
>>>>> application code/files, including the fillable PDF file),
>>>>> remain archived up in a jar file that WebLogic creates,
>>>>> instead of exploded.
>>>>> 
>>>>> My application utilizes the standard Maven application
>>>>> structure, so as is standard with all static files, I have
>>>>> put my PDF file in the directory for static resources:
>>>>> src/main/resources/
>>>>> 
>>>>> In my pom.xml file, I have the following, which builds any
>>>>> pdf files in the /src/main/resources/ folder, into the
>>>>> class path root folder of the WAR file.  
>>>>> ${basedir}/src/main/resources/
>>>>>  **/*.xml
>>>>> **/*.properties 
>>>>> **/*.pdf  
>>>>> 
>>>>> When I build the WAR and EAR file, the pdf file does indeed
>>>>> get copied into the root folder of the application's class
>>>>> files.
>>>>> 
>>>>> The following 3 lines of code, work to load the PDF, when
>>>>> my application's EAR file is deployed in Tomcat, but do not
>>>>> in WebLogic 12c (WebLogic Server Version: 12.2.1.2.0).
>>>>> 
>>>>> //this classLoader works for Tomcat, but no in WebLogic 12c
>>>>>  ClassLoader classLoader = getClass().getClassLoader();
>>>>> File file= new File(classLoader.getResource("
>>>>> myPdfFile.pdf").getFile()); PDDocument document =
>>>>> PDDocument.load(file);
>>>>> 
>>>>> WebLogic 12c produces the following error:
>>>>> 
>>>>>    
>>>>>   <[ACTIVE] ExecuteThread:
>>>>> '6' for queue: 'weblogic.kernel.Default (self-tuning)'>
>>>>> <> <>
>>>>> <3902331f-a214-42fe-a6a1-35b3531e4b56-00a9> 
>>>>> <1523985710798> <[severity-value: 8] [rid: 0]
>>>>> [partition-id: 0] [partition-name: DOMAIN] >  
>>>>> <[ServletContext@1661988196[app:mmhsrp-ear-4.9.0.1-3 
>>>>> module:/mmhsrp path:null spec-version

Re: OutOfMemoryError in PDExtendedGraphicsState#getLineDashPattern

2018-03-20 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Andreas,

On 3/20/18 5:35 PM, Andreas Hubold wrote:
> I'm getting an OutOfMemoryError from PDFBox when parsing a certain
> PDF using the Apache Tika App v 1.17 - which uses PDFBox 2.0.8
> internally. This is reproducible even with 8GB heap.
> 
> The OutOfMemoryError happens in 
> org.apache.pdfbox.pdmodel.graphics.state.PDExtendedGraphicsState#getLi
neDashPattern,
>
> 
which contains this piece of suspicious code:
> 
> COSArray dp = (COSArray) dict.getDictionaryObject( COSName.D ); if(
> dp != null ) { COSArray array = new COSArray(); dp.addAll(dp);
> 
> The last line seems to wrong?

That certainly looks wrong to me.

> It appends all elements from 'dp' to 'dp' again, effectively 
> duplicating the elements in the list. Maybe it should be 
> 'array.addAll(dp)' or something like that?
> 
> Can you confirm this being a bug? Should I open a JIRA ticket for
> this problem?
> 
> Do you know a workaround to avoid the crash, e.g. an option to skip
> some parts of the file for text extraction?

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqxgDYdHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFj/Ew/7BqHbZpfLea7necmh
zY6oOLIgLRwoarm61rWt8Kz6+Z+SGgU/8x5exQvJoZh8UhBG/sJ3OBIpdx5utMVM
/XsvEj8k0CEMPLnvhq5D+akszJbfB3GWZgwZVdhUq6tMbWKPrXVqlJ4/boLBlWYY
gOdkIkkULFuJtdk8rQ8GctbBmMnraSCyEvShLuuVOOi/m0MOMJnHIO6Ul6odWxWr
gDLVsT4UXVb6G2fDDeTx9LkadOalAFDAbSNlH+MwI/uoA3L9o9Vs7Hz8LE5pt4ds
ATBMS44hm+mk46t41VCD+dWP5adsJyZdzcZW+td0TUVGskeTHGfQ1uqDbBlFWyyA
n06sqi5xFnJvO/nCAl8lX0P8xPhJG1xi1/oF4vHAr3LzwxELE5U5oV+l2Qk06Sdc
RUNMuEyruiDlxj0Xm4xOnyy0X08RWjIp0XPyYW7DpGNIFxd+Wq/RC2ybUtSi2Ek7
2b5bd4rvk1jXdkEoBol/UB2rhNYDQUyqNPwU1ManA1coaHhqPRpDo8j4J0+ika9p
+qsdsgRqOu5oIzBHE8uLnW+ViuAuuFDNGySWgbxdelrARXGj/1MgTaFqQUKjNwHg
qFdZ9P29Kwv+oqQvJdkPpre9YoP2EJI49gV5EBakerM5/6BY+4wV03pNhtwoSL0r
tr/qb0cGpzAr+2kKZsohQYDjEa0=
=OFd7
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Strategy for dealing with non-latin characters in base font

2018-03-12 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 3/10/18 6:26 AM, Tilman Hausherr wrote:
> Am 09.03.2018 um 21:57 schrieb Tilman Hausherr:
>> You can subset fonts by using PDType0Font.load(), this limits
>> memory usage.
> 
> I meant "file size".
> 
> Also important: try to use each font only once per file. You can
> reuse fonts and images within a file. See also
> 
> https://stackoverflow.com/questions/48377121/pdfbox-generates-pdf-a-fi
le-of-very-large-size

Okay,
> 
we don't have any calls to any font's load() method... we are
simply using PDType1Font.CONSTANTS for our various fonts (HELVETICA_*
and TIMES_*). Will that give roughly the same behavior? We don't seem
to be having any problems with file size (yet).

I'm not even really sure what our options are for fonts; our usage of
PDFBox is quite simplistic. Let's say I want to be able to render a
character such as ® or ≤ in a PDF. How do I go about discovering which
(available) font contains the proper mapping, and then ensure that
font's glyphs are included in the PDF? Is it better to pick a font for
the whole document that includes that glyph, or is it better to change
the font for a single character, just to get the right display?

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlqmgDsdHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFheLhAAyMi4Mtw9poaPuOUY
LUh8lSpeEJmZSc84UQFLJiM/nmAvHheQEKFgR2+Y53ZNAUgfpKn83ieKsZwbrmWo
S17b+Z3Ybw+m+I9KOeqhyVppL/p5KWdtCyyl/3lOdtIcYVOUmIpcN86lfkNXN4uB
LMtTfqpYQhJnN5wU2rGbTXeh3M5H0JiesL8hzBrIxznGzDm6tTNNs3yvli0vDL2h
X8VoT2tHwYQ+RXJXGDRew9wF6udySyp9XwseYX4NMZK/Lq3Ro8FJNdyICwyGdWhh
MYzKHy3BUyeFw+3DAmeZ77I188+YgFed4ZiFnS29z25u8f9XTXZc5sjg+kot2HUG
pPN/wutOGHCYwtpW0L7iaeFtw1xpc9CSllrEkPIYsOlS2gZCot3aU8fWU989ngJA
NKMQ0+VRiXX2ToTeRmIKo02rkABAUD3nRJLuyHLbeH+QDaAT2GoZFBy4ZjnwQDWV
kWDpxYNPJHR7otyISEfG/+a8+pU5TleRDBKiIcbV4Q+V+i69WA+ihsK9fyLWJuLs
5cZmYCAEjzRgRTin/PpLjvghLH4C91AhBP0NJTLHHVw2/7odfZmUYnfq899AZnsY
qjf4k2Kli8rT696mFStlxuQSbu77v98yEe6kRuw5iF9mQkYI3qW7Fnynoj/oBib9
R8iaER3xluvWY8K7olGkkvhTKd8=
=WALV
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Strategy for dealing with non-latin characters in base font

2018-03-09 Thread Christopher Schultz
All,

Like many other folks, we often run across the "[character] is not
available in this font [font]" when we try to use some character such as
≥ or ≤ or ® or whatever.

(We are trying to keep the generated PDF files as small as possible, so
we'd prefer to use the simplest fonts possible.)

For most of these characters, we have simply created a mapping in our
PDF-generation wrapper code that does roughly this:

  string = string.replace("≤", "<=");
  string = string.replace("®", "(R)");
  etc.

(It's smarter and more efficient than that, but you get the idea.)

We are starting to find more and more characters that aren't quite as
decomposible as those shown above, such as ç (that's the c-with-hook you
see in the word garçon). We have resorted to replacing them with just a
plain-old 'c' for the time being.

But I'm wondering if there is a better way to handle these. I'm
perfectly happy to replace ® with (R) for the most part, but when we
have these characters with no direct replacements (and our product
includes support for emoji, etc.), is there a way to add a single glyph
for a single character to a PDF? Something like:

document.addCharacterGlyph('ç', image)

?

Or is there a good way to determine what font contains a particular
glyph such as ç?

One of our goals is to keep the resulting size of these PDFs to a
minimum, so we'd like some strategy that doesn't require including MBs
of font information just on the off-chance that a string we need to put
in there has such a character.

Thanks in advance,
-chris



signature.asc
Description: OpenPGP digital signature


Re: Using Pdfbox in android app.

2018-03-02 Thread Christopher Schultz
Michał,

On 3/2/18 9:05 AM, Michał Walawender wrote:
> I am Android developer, and i would ask You question about Pdfbox usage in
> app, which I am currently developing. I have chcecked license, but still I
> am not sure, if I can use Pdfbox in my app with ads in?

AL2 is a fairly liberal license. I don't see why you couldn't use PDFBox
in an app that contains ads.

> It is just app which will be placed in Google Play, and i want to monetize
> it with ads inside. Can You explain me, if I can use Pdfbox, based on
> current license, and do I have to place any disclaimers or something?

No disclaimers are necessary, since your app is not a "derivative work".
Attribution is not strictly necessary, but would be greatly appreciated
by the project.

IANAL, YMMV, etc.

I'm curious... why would an Android app need to produce PDF files?

-chris



signature.asc
Description: OpenPGP digital signature


Re: [MISC] Cannot use the web interface from " https://lists.apache.org/list.html?users@pdfbox.apache.org " , login does not work.

2018-02-02 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

All,

I think you have to be in the ASF addressbook (e.g. a committer) in
order to post from lists.apache.org. Not absolutely sure, though.

- -chris

On 2/2/18 3:24 PM, Tilman Hausherr wrote:
> We can't help you there ask Apache infra. To contact them,
> study this:
> 
> https://apache.org/dev/infra-contact
> 
> Tilman
> 
> Am 02.02.2018 um 09:09 schrieb Serban Alexe:
>> I cannot use the login interface from the web site to reply
>> directly to email threads.
>> 
>> I tried the "Sign in with Google" option, but the next screen
>> hangs on forever (see attached screenshot) . Same behaviour in
>> Firefox, MS Internet Explorer, and MS Edge.
>> 
>> Any ideas ?
>> 
>> Thanks.
>> 
>> 
>> -
>>
>> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
> 
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQJRBAEBCAA7FiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlp05FsdHGNocmlzQGNo
cmlzdG9waGVyc2NodWx0ei5uZXQACgkQHPApP6U8pFhKKg//YG9sZ84CX6OtEjVv
GdyAzQufI2BtsCV5Xkp1YDnFo9mZ22gGs5zKz9l/gdA/2avR3/aobcbV04QoSNt4
ielA6PSV2i5VUCVSQlTc7LWMlp0caTDubLJ98D0KUxcveXRtui+jKyZApiaphgNu
CSxgDUZHD6OKrwUcx0cJ+N5Bjfnc+VBM++0ZitwiujfN76c7irSXZ76GAI2zhTjQ
dpkeBzxWbYqEuDIE2cd23Rumqzv+ovvQTccP/lJxjlI1mTwWWnH3vKBMVv9li/12
KyYuAVNG8KHCoEcb2kaHuBRFlcFgxeUy3Epy6jxpBBrLyAeheABIoh9e56SLFK4Z
SsbPx5td1Qs6AwjnlzoaVSEh7zpp1OVriPb+HFdDj6OmgFz44FhYNINZrxVo85u2
fp8/xuIy/O8dkQFmL4HR4GkHKCeuNcR3HGveSWEGk/fAlp28tUIFhGIjV4titIb+
aEb2mC9M7HWRtKXYx7OvGMIud+Nd708HCG4/S12h96Zk9hjxKJ7WKEJ2tn9cPLyf
S/du1YXKjsHcAd64XW/bm52h5m7gyPy/e7j1zN0fLnk+OhBFMFeCBN/lrM3GsPw6
s8X6ZsFMC5WiGzuBEVPKbulAmAaDZ8v1FFIb+YaQw1DELwCY3UcMfMWBuuieJToy
+YgwqdcHICJKcDQPwdJ/1OdXz+A=
=nBp+
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: tips on pdf font size

2017-08-23 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

To whom it may concern,

On 8/23/17 7:01 AM, chitgoks wrote:
> hi.
> 
> if anyone has ideas regarding font size in pdf and the font size in
> html, would like to ask some inputs
> 
> i created a free text annotation using coordinates taken from
> javascript code and with html font size and a fixed font helvetica
> but the result is not the same when generated in pdf.
> 
> using the html font size value to the annotation created results in
> the text size being too big.
> 
> so it is not in pixel when it comes to pdf but point. and there are
> various things to consider when converting the equivalent point
> font size from the pixel font size.
> 
> what is the most common factor to convert the html font size to pdf
> font size in point?

What is the "font size" in the HTML document? What units are being
used there?

http://www.endmemo.com/sconvert/pixelpoint.php

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJZnbI4AAoJEBzwKT+lPKRYj84P/3eFnoA5Jq26riTsdVaO4+tk
SuEyRoX10YzWco7TIaavpRoZub/PgmLvm+iZQSUl7EDkSLFipBwZ2cpmrPBnqw7D
gQ+ULZBydeWrt5k6rk4VdyHsCXG+m4oZcUHQ6UNn9UZerMHhDUIPKH2uYC+lfngM
6CaF3wwlzDRonporyTNPvyYatsemlpylwwaPiU8RCDR+6CPnecQ6TeauomiyvdhZ
xpzhIMtt5H3NdbIF/0OzRDgw7C9BNfsTlEgJ7C4SXrqSLVc8p0p1Otzqu82NjqiR
IhJAatAE9dIoyERvYvmExTc5j864xSo7GQdxEqoQBselEqUnNhR5eogxcag1QpCD
KO6rgVqF0ClDAXtqgP9/l3B998rLCytKsh7YdHW5v1r+UnndOtbBSl+/dP8XD45V
LERXjyGJVqGr5EOvlDv6uO72+nJsx1IqAdIMA+QHuZf+YxbDlCE4WtH5mB7KAtb/
qz4dXtlVvZJJEzrkXmoulQyigsHMBkqgEEC+Fx4nQoXW51SePsPKcbugP6K59jWM
PEEkyKP6pRfcqklbl7IA+XHQ9ws5P+AYkAw7hg2OwiUqez/1jgtvN7Ltc2R2eV1A
fMLlC6ABaYnQxKli6prBEPZ4Q5av1Y7SDk52BAQscz/5d1Mj5aoR3Cv0teyTIxhI
qxcQIkHcfPnRi8E1mcAJ
=RqbX
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Limit PDF size

2017-08-02 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 8/2/17 2:28 AM, Tilman Hausherr wrote:
> Am 02.08.2017 um 01:17 schrieb Christopher Schultz: Tilman,
> 
> On 8/1/17 4:42 PM, Tilman Hausherr wrote:
>>>> Am 01.08.2017 um 22:09 schrieb Christopher Schultz: Tilman,
>>>> 
>>>> On 8/1/17 3:22 PM, Tilman Hausherr wrote:
>>>>>>> The only thing that comes close to what you want is to
>>>>>>> create your PDDocument with
>>>>>>> MemoryUsageSetting.setupMixed(...) as parameter.
>>>> So that we can buffer to disk if the in-memory representation
>>>> gets too big? That sounds like a good approach, and probably
>>>> the most useful to m e.
>>>> 
>>>> It also appears that I can set a maximum in-memory limit
>>>> like this:
>>>> 
>>>> MemoryUsageSetting mus =
>>>> MemoryUsageSetting.setupMainMemoryOnly(1 * 1024 * 1024);
>>>> PDDocument doc = new PDDocument(mus);
>>>> 
>>>>> Yes. Although this would mean you'd get an exception if you
>>>>> use more. That's why I recommend the mixed one. You could
>>>>> use the memory limit for stress tests, i.e. create the
>>>>> "worst" possible file and see what you need.
> I think I'm okay with an exception in these cases. As I said, our
> PDFs only end up being a few kiB in size, so I've put a 1MiB cap on
> the memory-only memory usage strategy for the time being.
> 
> I'm curious about what's being constrained, here... does PDFBox 
> estimate its current memory-usage of various PD* objects in memory
> and push to disk when that's exceeded, or does it just limit the
> amount of memory that gets used when serializing out to a stream.
> 
>> There is no estimate... it writes in the dedicated space and if
>> it is full, it's either exception (if memory only) or writing to
>> disk cache.

I get that, but I want to understand exactly what things are "counted".

>> Yes... it's mostly images, fonts and page content streams.

So, if I write an image to the PDDocument, that "counts" towards the
memory/disk limits? What about plain text? Or the
PDPage/PDPageContentStreams? If I write 1000 pages of plain-text to
the PDDocument object, will that "fill up" the limited-memory I have
configured? Or does that memory limit only count when e.g. serializing
to a "compiled" PDF file (or whatever the right terminology is)?

>> [Using built-in fonts] is even better, because it doesn't use any
>>  additional space (and is faster too). Your application is a
>> very simple one :-)
Yes, we are just taking some raw information and exporting it as a
PDF. We wanted something simple AND we wanted to have file sizes as
small as possible.

I have a related question about fonts, but I'll ask that in a separate
thread.

>> You really should worry about other things... choose one or
>> many: climate change, russian hackers, terrorism, rising interest
>> rates, traffic jams, heavy rain flooding your basement, people
>> who don't wash their hands, whatever :-)

Who says I'm not a Russian hacker/taxi driver/hedge fund manager who
pours truckloads of water on people's houses without warning?

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJZgdbbAAoJEBzwKT+lPKRYQqwQALgwGrOVs2imMdDSMmk9dwNm
dFoVemVdXVnBvAdlF6tTosKIIKo9jDs2csxpcsOXnug/GytFGxR7NKWF8MRzBHf1
RulHwtarKxBRYef0jiPOOcrXLRxtQL3GWMEpQl+pZGIyBkh9DVvsUNnqQyaew3CF
UiXMekauOq6yLpV3MSVFPF1Wh+jKKVwVG/3rhrEPFhuS22TSTyWbHkNlgky2PwJs
wMt3tMJQpe3PWQtEWXETJc119n97mmt9RtifcuNIcCV9k8/+RO0U+NpXfZzYYS68
WTl9f60iHeMvrAkfiK1QpwgY6HvvyZRDuLAk+45kopJjR7tlHS6L20f3YsYAyYPK
fWIrEm2glFOv4IPwbzQzP9DM7p3ti41i8E41zcvVT7hUv541tfEB/xdMTZgIA/m4
1Km1jp2Vdww1GyjT+llLswxnh+JAPQlk4lGippdxy7HLFzSqzS9XV2uBZdpQcpjx
28VuR/3idYtl7IVByQjtsNQsbuPApREqv45Za06Q08r2wXEgbc0mV6kuSTY9NyDQ
OMD6Df+rbZlKRZyHFAXAGgTcRMdxhcYkW6UFr9XWlpPFi2jGnkIGW920p/9cFd7U
o2nmRVDPJBPhqWtEG3702fU1FFvouqWH4isb3lsPxawY4mk8ixFzXQiU/9TiaOkF
MUMfRfwBSW/ukkWPgj4t
=+oM3
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Limit PDF size

2017-08-01 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 8/1/17 4:42 PM, Tilman Hausherr wrote:
> Am 01.08.2017 um 22:09 schrieb Christopher Schultz: Tilman,
> 
> On 8/1/17 3:22 PM, Tilman Hausherr wrote:
>>>> The only thing that comes close to what you want is to create
>>>> your PDDocument with MemoryUsageSetting.setupMixed(...) as
>>>> parameter.
> So that we can buffer to disk if the in-memory representation gets
> too big? That sounds like a good approach, and probably the most
> useful to m e.
> 
> It also appears that I can set a maximum in-memory limit like
> this:
> 
> MemoryUsageSetting mus = MemoryUsageSetting.setupMainMemoryOnly(1
> * 1024 * 1024); PDDocument doc = new PDDocument(mus);
> 
>> Yes. Although this would mean you'd get an exception if you use
>> more. That's why I recommend the mixed one. You could use the
>> memory limit for stress tests, i.e. create the "worst" possible
>> file and see what you need.

I think I'm okay with an exception in these cases. As I said, our PDFs
only end up being a few kiB in size, so I've put a 1MiB cap on the
memory-only memory usage strategy for the time being.

I'm curious about what's being constrained, here... does PDFBox
estimate its current memory-usage of various PD* objects in memory and
push to disk when that's exceeded, or does it just limit the amount of
memory that gets used when serializing out to a stream.

>> Note that only streams are cached. Ordinary java structures (e.g.
>> maps, numbers, strings) are not.

Can you tell me a little more about that? When you say "streams are
cached", what does that mean exactly?

Or have I essentially already asked that question above?

> ... and then this should enforce a 1MiB size limit, no? I think
> that's all I want... there shouldn't be any reason for me to have
> to touch the disk: my files are really quite small. I just don't
> want something to go wrong with my client code and inadvertently go
> into an infinite loop adding "Hello World" to the document over and
> over until I have 50k pages in the PDF and an OOME on my hands.
> 
>>>> What you should do is to care to not have anything duplicate.
>>>> So if you have a company logo on every page, create your
>>>> object object only once. Same for fonts.
> We have something like:
> 
> private Font _theFont;
> 
> ... contentStream.setFont(_theFont); 
> contentStream.newLineAtOffset(x,y); contentStream.showText("Hello,
> world"); ...
> 
> 
> Many many times. The Font object reference stays the same, so I'm 
> guessing that's okay and the font is used once and referenced many 
> times, right?
> 
>> Yes!
> 
>> To create small PDF files, use PDType0Font.load() instead of 
>> PDTrueTypeFont.load(), this will subset the fonts after saving.

We are using PDType1Font.FONTNAME for everything, so we aren't calling
.load for anything at all.

>>>> And try to have only one content stream per page. (We
>>>> recently had a guy who had a huge number of content streams
>>>> and wondered why his PDF was so big).
> Check: we have only one PDPageContentStream per page.
> 
> We have a single logo on the first page and nothing repeated.
> 
> Our PDFs are almost 100% plain-text with lots of whitespace (which 
> doesn't count, I know). When base64 encoded, they are typically
> only a few kb in size.
> 
> I'm mostly operating from a position of borderline unhealthy
> paranoia, but I'd rather have a bit of code added to ensure that I
> don't have to get paged in the middle of the night to restart a
> service that has suffered an OOME.
> 
>> This all sounds harmless. All the memory problems I can think of
>> were related to rendering, not PDF creation.

Sounds good.

>> We've had a least one speed complaint, but that one is solved in
>> the current version.

I'll make sure we are up-to-date.

Thanks,
- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJZgQwMAAoJEBzwKT+lPKRYws4P/RvvC0+6xp5fMINPAey98Pj6
cxTSeAkm0RsLl9lZrCxBjVRHNGsKBd1G70fgFEp6uB+5tU14Na0m1nZZ2WNGtiko
dwTseWL/m/FiggHDrzsT+RQVlbBoUzhBpyHYmEkRnbfQnS98eE0ZTSlN59IAStzn
yD7jFEds/nJucJZk9O6so9lOa9waGMf+s2MEp1YfMizytuIRK4ch3JG5/cBVQa8S
2W3J/Y/fIQWXOAx433XuVG9rC00RKtaMJahjOwyhmUIznNlR/yGH+0iiqwziUyXX
UtqsPTyFrGHQcHr4gaiewug6V//P5HC+XYhqyU0AR1EJolYSGXPY0UtRuTgCtAQ0
FXFjaYPppumKCjV9QMIfRcps7XclwoV/kiip5H3DIZwIL81PRE3rjthuE75uAjps
OEtGWjte9DDfDkkV6gudp0DmCBWq6oMyw7m4vm7rLACPXt0ziZtEKU698N7m88T6
vFxLtZloUbGVj0UAe4Sr6e31fw+5+dp2gpFNgKSP8FBGWAGLA+6srSA9sucpsqev
yG4QgReFNclDgO7i/6H5W1DcNZeTOwLJ+vT5BJafSvgHBGhLGy3F1uM3IyeFMgf7
XBHr4Em8p41aGS0BCvtGQ+xFMPCPKIHEvZxLZ+1JxboS0g5+KT8LHnCWvXjc6gSa
w9Dyle4TNPUoJHp24k/p
=YM5j
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Limit PDF size

2017-08-01 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Tilman,

On 8/1/17 3:22 PM, Tilman Hausherr wrote:
> The only thing that comes close to what you want is to create your 
> PDDocument with MemoryUsageSetting.setupMixed(...) as parameter.

So that we can buffer to disk if the in-memory representation gets too
big? That sounds like a good approach, and probably the most useful to m
e.

It also appears that I can set a maximum in-memory limit like this:

MemoryUsageSetting mus = MemoryUsageSetting.setupMainMemoryOnly(1 *
1024 * 1024);
PDDocument doc = new PDDocument(mus);

... and then this should enforce a 1MiB size limit, no? I think that's
all I want... there shouldn't be any reason for me to have to touch
the disk: my files are really quite small. I just don't want something
to go wrong with my client code and inadvertently go into an infinite
loop adding "Hello World" to the document over and over until I have
50k pages in the PDF and an OOME on my hands.

> What you should do is to care to not have anything duplicate. So if
> you have a company logo on every page, create your object object
> only once. Same for fonts.

We have something like:

private Font _theFont;

...
contentStream.setFont(_theFont);
contentStream.newLineAtOffset(x,y);
contentStream.showText("Hello, world");
...


Many many times. The Font object reference stays the same, so I'm
guessing that's okay and the font is used once and referenced many
times, right?


> And try to have only one content stream per page. (We recently had
> a guy who had a huge number of content streams and wondered why his
> PDF was so big).
Check: we have only one PDPageContentStream per page.

We have a single logo on the first page and nothing repeated.

Our PDFs are almost 100% plain-text with lots of whitespace (which
doesn't count, I know). When base64 encoded, they are typically only a
few kb in size.

I'm mostly operating from a position of borderline unhealthy paranoia,
but I'd rather have a bit of code added to ensure that I don't have to
get paged in the middle of the night to restart a service that has
suffered an OOME.

Thanks for the pointers.

- -chris

> Am 01.08.2017 um 20:04 schrieb Christopher Schultz: All,
> 
> We use PDFBox on a server that must handle many transactions with 
> (somewhat) limited memory. I'd like to limit the amount of memory
> used to generate our PDFs, which we then serialize to byte-array, 
> base64-encode, etc. for ultimate delivery to some endpoint.
> 
> I can obviously limit the number of bytes produced by using a 
> size-limited OutputStream passed-into
> PDDocument.save(OutputStream), but I'm wondering if PDFBox has any
> facilities within it to limit the size of the object-tree in memory
> (or estimate its size, and we can stop operations when it reaches a
> certain size) so that we don't end up with a multi-GB object-tree
> that then fails to serialize to byte[] because it is too big.
> 
> We are building our PDF documents from scratch, starting with the
> page definitions, fonts, etc. then adding titles, paragraphs of
> text, etc. It's all fairly straightforward, and we have full
> control over the whole process up to and including the call to 
> PDDocument.save(OutputStream).
> 
> We are manually constructing our pages as well, so I suppose we
> could simply limit the number of pages, but I'm more concerned
> about the size of the memory used and not the number of pages.
> 
> Is there anything in PDFBox that can help us with this? We can
> always count e.g. the number of bytes/characters we have written to
> the PDF, but that seems less important than what is going on inside
> of the PDF structure itself.
> 
> -chris
>> 
>> -
>>
>> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
>> For additional commands, e-mail: users-h...@pdfbox.apache.org
>> 
> 
> 
> -
>
> 
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJZgN/2AAoJEBzwKT+lPKRYlLUQAK/eAna/kwigraXZ/ghwfB+U
qe36r5yqUc9TMmCa7cunJuLJxMAnH6UnbNzNJm4IChMXmtLk++uF9YMKpPN0irQr
RxAaNlUbNpnyJqXR/W/7ZTVo4gP2l7JYQqARcSLjxuROLqALF1jp8BoXMw0Zz8L4
rfEub/dVk3EIBvg+ithGeqzzb67yoPEbCP9LVsXoxyvrTER1mB28BmmSZsw2hVD5
HLKzmu3e4XLXdi+MKBfJfF0Y+S4/7/yq+4f0KBq/AD7VlNeUwOv6j0kiVkT5Tdv/
tJGtheC1M6dXVLqQD7/G/q37/kdgCeG12yTbpw8FUMbfn4yHrtd8Fqmxz6au8qpm
Fu0xhGy1SobxiGXgpFCNED0fdGz0f56TYFPb8KgtAveHZuoPlDcyq9WdDThRl/zn
Oxs1ytkFf4W0RbdNcR/wtQLxVUVbPUuNE5gFKqNf282H7fj5q/I3cyCmafUnecz0
bjcHfCS4EpciYnfJT1OihRGDGBXSHZfwXEqFva8h

Limit PDF size

2017-08-01 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

All,

We use PDFBox on a server that must handle many transactions with
(somewhat) limited memory. I'd like to limit the amount of memory used
to generate our PDFs, which we then serialize to byte-array,
base64-encode, etc. for ultimate delivery to some endpoint.

I can obviously limit the number of bytes produced by using a
size-limited OutputStream passed-into PDDocument.save(OutputStream),
but I'm wondering if PDFBox has any facilities within it to limit the
size of the object-tree in memory (or estimate its size, and we can
stop operations when it reaches a certain size) so that we don't end
up with a multi-GB object-tree that then fails to serialize to byte[]
because it is too big.

We are building our PDF documents from scratch, starting with the page
definitions, fonts, etc. then adding titles, paragraphs of text, etc.
It's all fairly straightforward, and we have full control over the
whole process up to and including the call to
PDDocument.save(OutputStream).

We are manually constructing our pages as well, so I suppose we could
simply limit the number of pages, but I'm more concerned about the
size of the memory used and not the number of pages.

Is there anything in PDFBox that can help us with this? We can always
count e.g. the number of bytes/characters we have written to the PDF,
but that seems less important than what is going on inside of the PDF
structure itself.

- -chris
-BEGIN PGP SIGNATURE-
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCAAGBQJZgMLBAAoJEBzwKT+lPKRY7HoP/RMTQOU8jIvuuoiB4A44rC85
mgrUoJ0aAhpgwtuWEupTNHk08fufWmzwFmaxH5SMLpM7FALvy+u3ssn1w2+5OBZA
xStk2Ni8dIo7jjl0YSC2wJ+z4FcfSzV40ut90OmrpkIpnPKM0ICYSERfxhiz1qzN
Fhwtty2r+6o/OpyxPAVcLotWIKsOaXPxNg+LGh2WuOko58eXAABgHnMOw5w0ptZp
n0DKDo36J0Y8towyQgGjUAIYq8a/8Lf1UVYsQI3qqZoo3B0N3TmSxy+wwNBJ9dyH
l599aaBOuUh4BFg0JoAjkEge8Qobl7UrYA88mf6mcOeEKkUy5cald0WKZLkJChl3
Vwybn8brtMJeXTTKpLQWgCQCDLmGUK181fGvofCqHaxdGkBYjqp4NjuqNXrG8adi
FpMsxhLk/gpYyJ7i1a3ta+PqU3rw103fEEP5YNQGZx64/Ec4sqzC7QJ0E08hqpau
/Ye9goyJ0a+620HS7GeYYCzN+bVVkdG2FUHWcRWG1hSF8+PLi1Y4atWDyWhG1qc7
l3KU6NfM5UE+jS/lElUQxyTF9GuGdIhYfkuC5nl6tF2FI6drRogQ95mkEwFGWbw9
ijssUevr43W/Gx5nUUKZueTjsLTDzgjCWk4cymDKAbP0PtOgE6EWxaP2LiFp9znH
JOMb727eMhcfuNEO0eMj
=odVE
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: catch(IOException | COSVisitorException e)

2017-07-15 Thread Christopher Schultz
Steve,

On 7/14/17 6:31 AM, Steve Carr wrote:
> Thanks for the info.
> The user uses Java 6 interpreter.
> Do you know if the 2.0.* versions of PDFBox will work on Java 6
> platforms?>
> If not how can I find code which will run on Java 6

It's worth pointing-out that multi-catch is just syntactic sugar: the
.class file is the same as it would have been if the catch blocks had
been separate (except that you'd see different source lines in any stack
traces).

So just because PDFBox 2 requires Java 7 to compile, it doesn't require
Java 7 to run.

-chris
> On 2017-06-26 19:45 (+0100), Tilman Hausherr wrote: 
>> Use the source code examples from the source code download, not from > 
>> some third party websites.> 
>>
>> Btw COSVisitorException no longer exists in the 2.0.* versions.> 
>>
>> Tilman> 
>>
>>
>>
>>
>> Am 26.06.2017 um 11:41 schrieb Steve Carr:> 
>>> import java.io.IOException;> 
>>> import org.apache.pdfbox.exceptions.COSVisitorException;> 
>>> import org.apache.pdfbox.pdmodel.PDDocument;> 
>>> import org.apache.pdfbox.pdmodel.PDPage;> 

>>> /**> 
>>> *> 
>>> * @author Azeem> 
>>> * @Email az...@radixcode.com> 
>>> */> 


>>> When I compile the following code in netbeans I get> 
>>> Uncompilable source code - package org.apache.pdfbox.exceptions does not 
>>> exist in relation tocatch(IOException | COSVisitorException e)> 

>>> I downloaded pdfbox-1.6.0-src.zip> 
>>> help> 
>>> steve> 
>>> public class Main {> 


>>> public static void main(String[] args) {> 

>>> System.out.println("Create Simple PDF file with blank Page");> 

>>> String fileName = "EmptyPdf.pdf"; // name of our file> 

>>> try{> 

>>> PDDocument doc = new PDDocument(); // creating instance of pdfDoc> 

>>> doc.addPage(new PDPage()); // adding page in pdf doc file> 

>>> doc.save(fileName); // saving as pdf file with name perm> 

>>> doc.close(); // cleaning memory> 

>>> System.out.println("your file created in : "+ 
>>> System.getProperty("user.dir"));> 


>>> }> 
>>> catch(IOException | COSVisitorException e){> 
>>> System.out.println(e.getMessage());> 
>>> }> 

>>> }> 

>>> }> 

>>
>>
>> -> 
>> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org> 
>> For additional commands, e-mail: users-h...@pdfbox.apache.org> 
>>
>>
> 
> 
> Sent from my iPad
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 



signature.asc
Description: OpenPGP digital signature