from:"Andreas Lehmkühler"

Re: Issue with PDFBox 3.0.0 - Unable to Extract and Add Pages

2024-04-07 Thread Andreas Lehmkühler


The issue was fixed and is part of the current 3.0.2 version of PDFBox.

Andreas

Am 27.02.24 um 10:11 schrieb Tilman Hausherr:

Hi,

It's like Fabian said.

Btw neither the code here nor the different(!) code in 
https://stackoverflow.com/questions/78065676/ would enable anybody to 
reproduce such a bug because it's incomplete.


Until we get this fixed, please stay with 2.0.* (2.0.30 is the current 
version), and also update your jdk, 1.8.0_91 is from 2016. The current 
version is 1.8.0_402.

You can also try a snapshot here from time to time:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/
Tilman

On 27.02.2024 08:55, Amber Prakash Verma wrote:

Dear PDFBox Team,

I hope this email finds you well. I am writing to report an issue I 
encountered while using PDFBox version 3.0.0. It appears that there is 
a problem when attempting to extract pages from one PDF and add them 
to another PDF.
While using the same code and PDFBox version 2.0.29, it is perfectly 
working and output PDF contains no blank pages.






-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Text extraction from a certain PDF does not seem to terminate

2024-04-06 Thread Andreas Lehmkühler


Hi,

Am 03.04.24 um 15:53 schrieb Brangs, Erik:

Hi,

when attempting text extraction from the PDF at https://d-nb.info/1324982411/34 
, either using PDFBox 3.0.0 or PDFBox 4.0.0-SNAPSHOT, the extraction uses about 
1,8 GB heap memory and does not seem to terminate. I cancelled the extraction 
attempt after roughly 20 minutes. Is this another bad PDF or is there a bug in 
PDFBox?


Thanks for the report. As Tilman already pointed out, the described 
behavior is a performance regression and was fixed recently, see [1] for 
any details.


Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5799




--
Erik Brangs
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
60322 Frankfurt am Main
Telefon: +49 69 1525-1792
Telefax: +49 69 1525-1799
mailto:e.bra...@dnb.de
https://www.dnb.de



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Lost xref table on two PDF merge

2024-04-04 Thread Andreas Lehmkühler


Hi,

which version of PDFBox are you using?

Did you save the merged pdf before you try to fix the signature? The 
resulting pdf should have a valid xref table.


Andreas

Am 04.04.24 um 15:09 schrieb František Šimon:

Hello,

  


I encounter a problem when trying to fix some problem with invalid
certificate in PDF that I am processing.

I have two PDFs which I am merging together. First one is some one page
template where I fill soma data and second one is multiple page PDF which
contains some corrupted signature.

  


When I merge these PDF first and then try to search for corrupted signature
I will get nothing from document.getDocument().getObjectsByType(COSName.SIG)
since xref table of second PDF is replaced with the one from first one.

  


I can get around it in my code but is there a possibility to keep xref table
from both PDFs after merge?

  


With best regards,

Frantisek Simon




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-25 Thread Andreas Lehmkühler

Am 25.03.24 um 10:07 schrieb Tilman Hausherr:

On 25.03.2024 07:48, Andreas Lehmkühler wrote:

Thanks for the URLs. All of them are working with my change.

See https://issues.apache.org/jira/browse/PDFBOX-5790 for further
details.

@Tilman Please run your tests if possible

No regressions 

Cool, thanks for the retest

Tilman

Andreas

Am 24.03.24 um 16:39 schrieb Tilman Hausherr:

Here they are, remove the XXX

https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D

The extension p1 / p3 means I split these files and used only one
page for my own tests.

Tilman

On 24.03.2024 16:19, Andreas Lehmkühler wrote:

Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
You are correct that it's the "fb" parts that are missing. (And
some of the other tools you tried also mention this)

Just adding true results in text extraction of several files no
longer being correct, 433525-p1.pdf
O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf
R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and with
PDFBOX-5540.pdf.

@Tilman I guess the other files are from our test corpus? If so,
were exactly can I find them?

Andreas

Adding "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()"
brings no regressions but your text is not extracted properly.

Maybe it is possible to include yet another rule for your file, but
there's likely more to do and there is the risk that other files no
longer extract properly.

Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
It seems that PDFBOX-5540 resolves a special case based on some
dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid
ToUnicode CMap

in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and
charToUnicodeTwoBytes)

are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100> 256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it
seems a kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several
blocks of

beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions)
just for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null ||
!encodingDict.containsKey(COSName.

DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações: BCD

Maybe the extract text tool should been using
begincidrange/endcidrange

information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite
long.

Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:

Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!

Em qui, 14 de mar de 2024 10:08, Tilman Hausherr

escreveu:

Hi,

The problem is in the ToUnicode stream, there's a log message
"Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode
mappings.
PDFBox is trying a fallback solution which turns out to be
wrong. This

is related to PDFBOX-5540 and earlier related issues.

Tilman

On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

Thank you very much for your attention!

You can find the file "p4_alt.pdf" in this folder
<

https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing

.
"Extra infos.pdf" file shows some output from PDF Debugger and
others.

I'm sorry, I sent the pdf file as an attachment in my first

message,

but I didn't know that it wouldn't work.

Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-25 Thread Andreas Lehmkühler


Thanks for the URLs. All of them are working with my change.

See https://issues.apache.org/jira/browse/PDFBOX-5790 for further details.

@Tilman Please run your tests if possible

Andreas

Am 24.03.24 um 16:39 schrieb Tilman Hausherr:

Here they are, remove the XXX

https://corpora.tika.apache.org/XXXbase/docs/govdocs1/433/433525.pdf
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/O2/O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP
https://corpora.tika.apache.org/XXXbase/docs/commoncrawl3/R4/R4EXG25W532JHDQLJAM4HF6O532TLR7D

The extension p1 / p3 means I split these files and used only one page 
for my own tests.


Tilman


On 24.03.2024 16:19, Andreas Lehmkühler wrote:



Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
You are correct that it's the "fb" parts that are missing. (And some 
of the other tools you tried also mention this)


Just adding true results in text extraction of several files no 
longer being correct, 433525-p1.pdf 
O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf PDFBOX-5540.pdf 
R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and with 
PDFBOX-5540.pdf.


@Tilman I guess the other files are from our test corpus? If so, were 
exactly can I find them?


Andreas



Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" 
brings no regressions but your text is not extracted properly.


Maybe it is possible to include yet another rule for your file, but 
there's likely more to do and there is the risk that other files no 
longer extract properly.


Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
It seems that PDFBOX-5540 resolves a special case based on some 
dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode 
CMap

in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and 
charToUnicodeTwoBytes)

are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems 
a kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several 
blocks of

beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) 
just for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 


escreveu:


Hi,

The problem is in the ToUnicode stream, there's a log message 
"Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode 
mappings.
PDFBox is trying a fallback solution which turns out to be wrong. 
This

is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

  Thank you very much for your attention!

  You can find the file "p4_alt.pdf" in this folder
<

https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing

.
"Extra infos.pdf" file shows some output from PDF Debugger and 
others.


  I'm sorry, I sent the pdf file as an attachment in my first

message,

but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <

thaush...@t-online.de>

escreveu:


Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,

  I'm

[ANNOUNCE] Apache PDFBox 2.0.31 released

2024-03-24 Thread Andreas Lehmkühler


The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 2.0.31 The release is available for download at:

https://pdfbox.apache.org/download.html

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 2.0.31

Introduction


The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is an incremental bugfix release based on the earlier 2.0.30 
release. It contains

a couple of fixes and small improvements.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-2725] - [PATCH] Split pdf lose accessibility tags
[PDFBOX-5375] - Allow creating of PDFXObjectImage without accessing to 
the image stream
[PDFBOX-5713] - PfbParser fails to parse PFB font with multiple binary 
records.

[PDFBOX-5715] - Lines vanish when printing on MacOS
[PDFBOX-5718] - java.lang.IllegalArgumentException: Provided dictionary 
is not of type 'COSName{OCG}'
[PDFBOX-5721] - The embedded font DroidSansFallbackFull reports an error 
when parsing, and finally uses lastResortFont, resulting in garbled fonts.

[PDFBOX-5723] - COSName caches already cached hashCode
[PDFBOX-5727] - Font operation takes a long time with 3.0.1
[PDFBOX-5728] - NullPointerException in TTFSubsetter.buildPostTable()
[PDFBOX-5732] - Problem converting PDF to image 
(java.awt.color.CMMException: Can not access specified profile)

[PDFBOX-5735] - Set the default value for PDNonTerminalField
[PDFBOX-5737] - java.lang.ArrayIndexOutOfBoundsException Bug Report
[PDFBOX-5738] - Wrong colors in PDF since PDFBOX-5488
[PDFBOX-5740] - Java 7 support on 2.0
[PDFBOX-5751] - Convert to image exception
[PDFBOX-5754] - PDF conversion in this format is very slow. Is there any 
room for optimization?

[PDFBOX-5763] - IllegalArgumentException: -Infinity is not a finite number
[PDFBOX-5772] - Inconsistent signature page handling when signing in 
existing signature fields

[PDFBOX-5773] - Add leading "0" for octal values in MacOSRomanEncoding
[PDFBOX-5776] - DataFormatException: invalid distance too far back
[PDFBOX-5778] - Grayscale JPEG rendered multicolor
[PDFBOX-5781] - OutOfMemoryError in FileSystemFontsProvider.scanFonts
[PDFBOX-5782] - NPE in PageDrawer.getPaint()
[PDFBOX-5785] - Issue with embedded Font and descendant Font
[PDFBOX-5787] - LCMS error 13: Mismatched alpha channels

New Feature

[PDFBOX-5768] - Enable Native Markdown Extraction in Apache PDFBox

Improvement

[PDFBOX-5762] - When splitting, keep page destinations that are part of 
target document(s)

[PDFBOX-5783] - Replace Exception with some repair attempt

Task

[PDFBOX-5739] - Add test for PDFBOX-3347
[PDFBOX-5741] - Add test for PDFBOX-4106

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by a SHA512 checksum and a PGP signature
that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://www.apache.org/dist/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit https://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit https://www.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Type 0 font - Text extraction X PDF Debugger

2024-03-24 Thread Andreas Lehmkühler





Am 15.03.24 um 05:35 schrieb Tilman Hausherr:
You are correct that it's the "fb" parts that are missing. (And some of 
the other tools you tried also mention this)


Just adding true results in text extraction of several files no longer 
being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf 
PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
I've found a solution which works with provided pdf and with 
PDFBOX-5540.pdf.


@Tilman I guess the other files are from our test corpus? If so, were 
exactly can I find them?


Andreas



Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings 
no regressions but your text is not extracted properly.


Maybe it is possible to include yet another rule for your file, but 
there's likely more to do and there is the risk that other files no 
longer extract properly.


Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
It seems that PDFBOX-5540 resolves a special case based on some 
dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems a 
kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several blocks of
beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) just 
for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
escreveu:


Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
PDFBox is trying a fallback solution which turns out to be wrong. This
is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

  Thank you very much for your attention!

  You can find the file "p4_alt.pdf" in this folder
<

https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing

.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

  I'm sorry, I sent the pdf file as an attachment in my first

message,

but I didn't know that it wouldn't work.



Em qui., 14 de mar. de 2024 às 07:16, Tilman Hausherr <

thaush...@t-online.de>

escreveu:


Hi,

please upload your file to a sharehoster.

Tilman

On 13.03.2024 20:03, Luiz Marcelo Modesto wrote:

Hi everyone,

  I'm not sure if this is the same as FAQ "How come I am getting
gibberish(G38G43G36G51G5) when extracting text?"...

  I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment
(build 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).

  I'm trying to understand how this PDF chunk (from p4_fix.pdf

attached)

    BT
    /G1F7 6.0 Tf
    94.871 773.806 Td
    <004200430044> Tj
    ET

  becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe
Reader, Chrome, ...) and becomes "abc" on PDFBox text extraction 
tool.


  Using the Poppler pdftotext (version 22.02.0) gives me 
"BCD" too.


  The renders that allow me to copy the text give me "BCD" text.

  It seems that PDFBox extraction tool follows the item "9.10.2
Mapping character codes to Unicode values" (ISO 32000-2:2020) but 
all

the others choose a different way.

   Could

[ANNOUNCE] Apache PDFBox 3.0.2 released

2024-03-14 Thread Andreas Lehmkühler


The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 3.0.2. The release is available for download at:

https://pdfbox.apache.org/download.html

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 3.0.2

Introduction


The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is an incremental bugfix release based on the earlier 3.0.1 
release. It contains

a couple of fixes and small improvements.

A migration guide is available at 
https://pdfbox.apache.org/3.0/migration.html. It is
still a work in progress and we are happy to include any valuable 
feedback from our

community.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-2725] - [PATCH] Split pdf lose accessibility tags
[PDFBOX-5375] - Allow creating of PDFXObjectImage without accessing to 
the image stream

[PDFBOX-5704] - char not rendered
[PDFBOX-5714] - PDFBox 3.0 regression: duplicate references in 
dictionary values

[PDFBOX-5715] - Lines vanish when printing on MacOS
[PDFBOX-5717] - NullPointerException calling 
saveIncrementalForExternalSigning
[PDFBOX-5721] - The embedded font DroidSansFallbackFull reports an error 
when parsing, and finally uses lastResortFont, resulting in garbled fonts.

[PDFBOX-5722] - Wrong scope for maven dependencies
[PDFBOX-5723] - COSName caches already cached hashCode
[PDFBOX-5724] - CharStringCommand.equals() does not conform to the 
contract of Object.equals

[PDFBOX-5727] - Font operation takes a long time with 3.0.1
[PDFBOX-5728] - NullPointerException in TTFSubsetter.buildPostTable()
[PDFBOX-5730] - The expected SubstFormat for ExtensionSubstFormat1 
subtable is 108 but should be 1
[PDFBOX-5732] - Problem converting PDF to image 
(java.awt.color.CMMException: Can not access specified profile)
[PDFBOX-5733] - lookupType is to be replaced by extensionLookupType in 
type 7 lookup table

[PDFBOX-5735] - Set the default value for PDNonTerminalField
[PDFBOX-5737] - java.lang.ArrayIndexOutOfBoundsException Bug Report
[PDFBOX-5738] - Wrong colors in PDF since PDFBOX-5488
[PDFBOX-5742] - Split result PDFs broken
[PDFBOX-5744] - EOFException while readMultipleSubstitutionSubtable()
[PDFBOX-5745] - EOFException while readSingleLookupSubTable()
[PDFBOX-5748] - Cannot get overlayPDF working on command line interface
[PDFBOX-5751] - Convert to image exception
[PDFBOX-5752] - Font errors after copying a page to another document
[PDFBOX-5754] - PDF conversion in this format is very slow. Is there any 
room for optimization?

[PDFBOX-5757] - streamCacheCreateFunction not passed to PDFParser
[PDFBOX-5758] - ExceptionInInitializerError when unmapping is not supported
[PDFBOX-5760] - NPE in FIlter.decode() when called with empty list
[PDFBOX-5763] - IllegalArgumentException: -Infinity is not a finite number
[PDFBOX-5764] - Wrong chunksize when using a ByteBuffer to initialize a 
RandomAccessReadBuffer
[PDFBOX-5772] - Inconsistent signature page handling when signing in 
existing signature fields

[PDFBOX-5773] - Add leading "0" for octal values in MacOSRomanEncoding
[PDFBOX-5775] - importPage destroys annotations
[PDFBOX-5776] - DataFormatException: invalid distance too far back
[PDFBOX-5778] - Grayscale JPEG rendered multicolor
[PDFBOX-5781] - OutOfMemoryError in FileSystemFontsProvider.scanFonts
[PDFBOX-5782] - NPE in PageDrawer.getPaint()

New Feature

[PDFBOX-5768] - Enable Native Markdown Extraction in Apache PDFBox

Improvement

[PDFBOX-5729] - GsubWorkerForDevanagari and GsubWorkerForGujarati created
[PDFBOX-5762] - When splitting, keep page destinations that are part of 
target document(s)

[PDFBOX-5783] - Replace Exception with some repair attempt

Task

[PDFBOX-5739] - Add test for PDFBOX-3347
[PDFBOX-5741] - Add test for PDFBOX-4106

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by SHA512 checksums and a PGP signature
that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://www.apache.org/dist/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit https://pdfbox.apache.org/

About The Apache Software Foundation

Re: Help with NullPointerException org.apache.io.IOUtils.LOG

2024-03-12 Thread Andreas Lehmkühler


Hi Matthew,

this is a known issue with 3.0.1, see [1] for further details.

The upcoming version 3.0.2 includes a fix. Unless nothing unforeseen 
happens, the new version will be available in about 2 days from now.


Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5758


Am 12.03.24 um 17:40 schrieb Matthew Hardy:

Hello,

We've recently upgraded to pdfbox 3.0.1. When attempting to instantiate an 
empty PDDocument, we receive the following error.

Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.commons.logging.Log.error(Object, java.lang.Throwable)" because 
"org.apache.pdfbox.io.IOUtils.LOG" is null
 at 
deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.unmapper(IOUtils.java:278)
 at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
 at 
deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.(IOUtils.java:64)

This is a Jakarta EE 10 EJB maven project, running on Java 17 in Wildfly 
30.0.1.Final. commons-logging 1.2 has been added as a dependency.

Any help would be greatly appreciated!

Matt Hardy
Software Developer
Perform Air International
463 South Hamilton Court
Gilbert, Arizona 85233
Phone: (480) 610-3500
Fax: (480) 610-3501
matt.ha...@performair.com
www.PerformAir.com




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Importing landscape format and portrait format oriented pages into the same PDF causes PDF corruption

2024-03-02 Thread Andreas Lehmkühler


Hi,

I guess I've fixed https://issues.apache.org/jira/browse/PDFBOX-5752 and 
the fix works for PDFBOX-5775 as well.


@Fabian please give the newest SNAPSHOT build of 3.0.2 a try

Andreas

Am 23.02.24 um 11:43 schrieb Tilman Hausherr:

On 21.02.2024 16:07, Fabian Zünd SI-Solutions Gmbh wrote:
Hello I manged to try it all out with the Most current build 
pdfbox-app-3.0.2-20240221.085334-88.jar


The issue persists.

Maybe i'm doing the copying of the page completely wrong?


Hi,

You did nothing wrong. Sadly, this is the problem that I mentioned in my 
last mail. I've created https://issues.apache.org/jira/browse/PDFBOX-5775


Tilman



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: RE%3A Re%3A [External Sender] Re%3A PDFBox 3.0.1 compile dependency on junit-jupiterIn-Reply-To=<9f543108-ef5c-4c7a-bac8-d7c6009d9d5f%40gmail.com>

2024-01-10 Thread Andreas Lehmkühler


Hi,

the additional compile dependency shouldn't have any influence on your 
test cases as long as you don't change change something.


I'm wondering if you are following the advice and excluded the junit 
dependency?


Andreas

Am 05.01.24 um 12:16 schrieb Christian Wiech via users:

I just discovered that after a renovate bot update three weeks ago from 
pdfbox-3.0.0 to pdfbox-3.0.1 our builds are still green but no tests are 
executed at all. This means we were blind for about 3 weeks because of an 
automerged bugfix release.

We are not using TestNG but Junit provided by Spring Boot version 3.X. The 
tests are not failing but simply skipped and reported as passed. This leaves us 
in a false assumption of safety.
Gilis workaround for TestNG works for in our case too. But in my mind this is a 
major incident and should be fixed asap.
Cheers, Christian
On 2023/12/04 17:55:58 Gili Tzabari wrote:

For anyone else using TestNG for unit tests, you'll need to explicitly
exclude JUnit until this is fixed; otherwise, Surefire will refuse to
use TestNG.

org.apache.pdfbox pdfbox 3.0.1 org.junit.jupiter junit-jupiter

Gili

On 2023-12-03 20:47, Dan Rabe wrote:

Great, thank you! We’ll look forward to seeing this in the next release!

--Dan

From: Andreas Lehmkühler
Date: Sunday, December 3, 2023 at 1:58 PM
To:users@pdfbox.apache.org
Subject: [External Sender] Re: PDFBox 3.0.1 compile dependency on junit-jupiter
solved, see [1] for further details.

Andreas

[1]https://urldefense.com/v3/__https://issues.apache.org/jira/browse/PDFBOX-5722__;!!Iz9xO38YGHZK!86ddyxmB45umUPT5RruBNFFOHrj4DuhHNvfFoJ0V1eQuJhQo9dtUS41wP9sKfM2mKCyhfjyTwkVcb52L0AYxMorg$

Am 02.12.23 um 09:05 schrieb Andreas Lehmkühler:

Hi,

Am 01.12.23 um 17:14 schrieb Dan Rabe:

It looks like a compile dependency on junit-jupiter snuck into the
3.0.1 release.

If I look at the maven page for 3.0.0 at
https://urldefense.com/v3/__https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/3.0.0__;!!Iz9xO38YGHZK!86ddyxmB45umUPT5RruBNFFOHrj4DuhHNvfFoJ0V1eQuJhQo9dtUS41wP9sKfM2mKCyhfjyTwkVcb52L0IYlyu3Q$
 ,
junit-jupiter is listed as a test dependency.
If I look at the maven page for 3.0.1 at
https://urldefense.com/v3/__https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/3.0.1__;!!Iz9xO38YGHZK!86ddyxmB45umUPT5RruBNFFOHrj4DuhHNvfFoJ0V1eQuJhQo9dtUS41wP9sKfM2mKCyhfjyTwkVcb52L0Bp0SxKX$
 ,
junit-jupiter is listed as a compile dependency.

As a result, the war file that I build would contain the junit
libraries. I’m assuming it’s a mistake of some sort that it got
reclassified as “compile” rather than “test”?

Your assumption is correct, it's a mistake. It was introduce with
PDFBOX-5699 which rearranged some parts of the maven build. My bad :-(

I'm going to fix that and doublecheck all the other components.

Thanks for the report

Andreas

-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Text extraction from a certain PDF uses up multiple GB of memory

2023-12-14 Thread Andreas Lehmkühler


Looks like https://issues.apache.org/jira/browse/PDFBOX-5479

Am 13.12.23 um 14:50 schrieb Tilman Hausherr:

On 13.12.2023 11:23, Brangs, Erik wrote:

Hi,

we ran into problems when doing text extraction from the PDF 
athttps://d-nb.info/1312454512/34  . We were using PDFBox 3.0.0 to extract the 
text and the text extraction used up multiple GB of memory. The problem can be 
reproduced with PDFBox 4.0.0-SNAPSHOT and PDFBOX 3.0.2-SNAPSHOT. Is there room 
for improvement in text extraction in PDFBox for this case or is this just a 
badly generated PDF?

Yeah it's a weird PDF: they have different font objects that point to 
the same font file (See FontFile2). So the font is opened each time and 
all tables are read amd stored. And since 3.0 we read much more tables 
than in 2.0.

Tilman



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFBox 3.0.1 compile dependency on junit-jupiter

2023-12-03 Thread Andreas Lehmkühler


solved, see [1] for further details.

Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5722

Am 02.12.23 um 09:05 schrieb Andreas Lehmkühler:

Hi,

Am 01.12.23 um 17:14 schrieb Dan Rabe:
It looks like a compile dependency on junit-jupiter snuck into the 
3.0.1 release.


If I look at the maven page for 3.0.0 at 
https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/3.0.0, 
junit-jupiter is listed as a test dependency.
If I look at the maven page for 3.0.1 at 
https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/3.0.1, 
junit-jupiter is listed as a compile dependency.


As a result, the war file that I build would contain the junit 
libraries. I’m assuming it’s a mistake of some sort that it got 
reclassified as “compile” rather than “test”?
Your assumption is correct, it's a mistake. It was introduce with 
PDFBOX-5699 which rearranged some parts of the maven build. My bad :-(


I'm going to fix that and doublecheck all the other components.

Thanks for the report

Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFBox 3.0.1 compile dependency on junit-jupiter

2023-12-02 Thread Andreas Lehmkühler


Hi,

Am 01.12.23 um 17:14 schrieb Dan Rabe:

It looks like a compile dependency on junit-jupiter snuck into the 3.0.1 
release.

If I look at the maven page for 3.0.0 at 
https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/3.0.0, 
junit-jupiter is listed as a test dependency.
If I look at the maven page for 3.0.1 at 
https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/3.0.1, 
junit-jupiter is listed as a compile dependency.

As a result, the war file that I build would contain the junit libraries. I’m 
assuming it’s a mistake of some sort that it got reclassified as “compile” 
rather than “test”?
Your assumption is correct, it's a mistake. It was introduce with 
PDFBOX-5699 which rearranged some parts of the maven build. My bad :-(


I'm going to fix that and doublecheck all the other components.

Thanks for the report

Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 3.0.1 released

2023-11-30 Thread Andreas Lehmkühler


The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 3.0.1. The release is available for download at:

https://pdfbox.apache.org/download.html

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 3.0.1

Introduction


The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is an incremental bugfix release based on the earlier 3.0.0 
release. It contains a couple of fixes and small improvements.


A migration guide is available at 
https://pdfbox.apache.org/3.0/migration.html. It is still a work in 
progress and we are happy to include any valuable feedback from our 
community.


For more details on these changes and all the other fixes and 
improvements included in this release, please refer to the following 
issues on the PDFBox issue tracker at 
https://issues.apache.org/jira/browse/PDFBOX.


Sub-task
[PDFBOX-5663] - Implement "about" dialog

Bug
[PDFBOX-5350] - Regression unicode mapping in Korean document
[PDFBOX-5649] - NPE in DomXmpParser.parseLiDescription
[PDFBOX-5654] - Avoid NPE when processing CFF2 based fonts
[PDFBOX-5658] - IllegalArgumentException: Dimensions (width=458477041 
height=26) are too large

[PDFBOX-5662] - Can not see checkbox check
[PDFBOX-5665] - NPE when converting pdf to image.
[PDFBOX-5666] - error encountered in splitting pdf using ver 3.0.0
[PDFBOX-5668] - NullPointerException in XMPMetadata.getSchema()
[PDFBOX-5672] - PDFToImage might not correctly detect unsupported image 
formats

[PDFBOX-5673] - Refactor Stream operations and operations on collections
[PDFBOX-5681] - ConcurrentModificationException in getObjectsByType() in 3.x
[PDFBOX-5682] - Long/permanent hang in PDFBox 3.x
[PDFBOX-5684] - Font cache isn't effective on my machine, always rebuilds
[PDFBOX-5687] - PDFBox 3.0 OSGi bundle requires sun.java2d.cmm.kcms package
[PDFBOX-5689] - Many new warnings "newGlyph ... newValue: ... is trying 
to override the oldValue" after upgrade to V3.0.0

[PDFBOX-5694] - PDF to Image conversion results in different converted image
[PDFBOX-5696] - COSStream lost, becomes a COSDictionary
[PDFBOX-5702] - Text in a certain font is lost when converting pdf to image
[PDFBOX-5706] - Incorrect colors in image from PDFs (DCTDecode)
[PDFBOX-5707] - Avoid NPE when accessing the elements of a COSArray
[PDFBOX-5712] - Stackoverflow in split
[PDFBOX-5713] - PfbParser fails to parse PFB font with multiple binary 
records.
[PDFBOX-5718] - java.lang.IllegalArgumentException: Provided dictionary 
is not of type 'COSName{OCG}'


New Feature

[PDFBOX-5670] - Allow repeatable subcommands in the command line tools
[PDFBOX-5683] - Inconsistent/incomplete PDF rendering

Improvement

[PDFBOX-4892] - Improve code quality (4)
[PDFBOX-5664] - 3.0.0: PDFCloneUtility needs a protected constructor to 
be useable outside of PDFBox when using Java 9 JPMS

[PDFBOX-5685] - Reduce number of copies to lower memory footprint
[PDFBOX-5693] - Consolidate bouncycastle configuration
[PDFBOX-5699] - Consistent scm.url values for pom.xml
[PDFBOX-5703] - use comparison operators for enums
[PDFBOX-5705] - update log4j dependency to 2.21.0
[PDFBOX-5711] - Loader: add support for java.nio.file.Path

Test

[PDFBOX-5667] - Can't create test for ExtractText command line tool

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by SHA512 checksums and a PGP signature
that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://www.apache.org/dist/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit https://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit https://www.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Odd OCG error

2023-11-21 Thread Andreas Lehmkühler





Am 21.11.23 um 21:26 schrieb John Lussmyer:

Ugh, formatting mess.
For more info, this is the "addOCGs:OCG" log line just before the error 
message:


10:53:09.765 [etrix SwingWorker[0]] DEBUG ImposedPDFEngine - addOCGs: 
OCG 
COSDictionary{COSName{Name}:COSObject{COSNull{}};COSName{Type}:COSObject{COSName{OCG}};}
The value for the type is an indirect object. Usally such values are 
direct objects. The type check fails as it expects a direct object as 
type value.






On 11/21/2023 10:56 AM, John Lussmyer wrote:
I'm using PDFBox 3.0.0 to combine some PDF files.  One of the files 
uses an Optional Content Group.
Note that this code has been working just fine for many other files 
both with and without OCG's.


For this file, I get this exception:

java.lang.IllegalArgumentException: Provided dictionary is not of type 
'COSName{OCG}'


    at 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup.(PDOptionalContentGroup.java:48) ~[pdfbox-3.0.0.jar:3.0.0]


Code:

*if*(obj*instanceof*COSDictionary) {

COSDictionary dict= (COSDictionary) obj;

COSName dType= dict.getCOSName(COSName.*/TYPE/*);

*if*(dType== *null*) {

*continue*;

}

*if*(dType.equals(COSName.*/OCG/*)) {

*/log/*.debug("addOCGs: OCG {}", dict);

PDOptionalContentGroup grp= *new*PDOptionalContentGroup(dict);

ocProps.addGroup(grp);

ocProps.setGroupEnabled(grp, layersON.contains(grp.getName()));

changed= *true*;

}

}

 It's failing on the "new PDOptionalContentGroup(dict)" call.
Any ideas on why?



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 2.0.30 released

2023-11-05 Thread Andreas Lehmkühler


The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 2.0.30. The release is available for download at:

https://pdfbox.apache.org/download.html

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 2.0.30

Introduction


The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is an incremental bugfix release based on the earlier 2.0.29 
release. It contains

a couple of fixes and small improvements.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-5350] - Regression unicode mapping in Korean document
[PDFBOX-5359] - Operators "q" and "Q" should also preserve text matrices
[PDFBOX-5623] - Signature Image not Rendered starting with PDFBox 2.0.23 
+ patch provided

[PDFBOX-5627] - Fonts are not subsetted when saving incrementally
[PDFBOX-5628] - Bug in PDFMergerUtility#mergeFields
[PDFBOX-5639] - Password protected PDF opens in GUI apps but PDFbox says 
invalid password
[PDFBOX-5642] - Wrong error message "2.4.1 : Invalid Color space, The 
operator "rg" can't be used with CMYK Profile"

[PDFBOX-5644] - Make FDF annotations more compliant with the specification
[PDFBOX-5649] - NPE in DomXmpParser.parseLiDescription
[PDFBOX-5651] - Regression: NoSuchElementException in PDFXrefStreamParser
[PDFBOX-5653] - The PageDrawer.strokePath method is blocked, and cpu100%
[PDFBOX-5654] - Avoid NPE when processing CFF2 based fonts
[PDFBOX-5658] - IllegalArgumentException: Dimensions (width=458477041 
height=26) are too large

[PDFBOX-5662] - Can not see checkbox check
[PDFBOX-5665] - NPE when converting pdf to image.
[PDFBOX-5668] - NullPointerException in XMPMetadata.getSchema()
[PDFBOX-5672] - PDFToImage might not correctly detect unsupported image 
formats

[PDFBOX-5684] - Font cache isn't effective on my machine, always rebuilds
[PDFBOX-5694] - PDF to Image conversion results in different converted image
[PDFBOX-5702] - Text in a certain font is lost when converting pdf to image
[PDFBOX-5706] - Incorrect colors in image from PDFs (DCTDecode)

New Feature

[PDFBOX-5683] - Inconsistent/incomplete PDF rendering

Improvement

[PDFBOX-4892] - Improve code quality (4)
[PDFBOX-5630] - Add PDRectangle#TABLOID paper size
[PDFBOX-5631] - Support version 0.5 of MaximumProfileTable
[PDFBOX-5632] - loca-table isn't mandatory for TTF/OTF-fonts using CFF 
outlines

[PDFBOX-5636] - Implement PDF 2.0 dash phase clarification
[PDFBOX-5637] - Add getter and setter for the CO array under PDAcroForm
[PDFBOX-5645] - Make UTC timezone static
[PDFBOX-5650] - Facilitate migration to PDFBox 3.0
[PDFBOX-5693] - Consolidate bouncycastle configuration
[PDFBOX-5699] - Consistent scm.url values for pom.xml
[PDFBOX-5703] - use comparison operators for enums

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by a SHA512 checksum and a PGP signature
that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://www.apache.org/dist/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit https://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit https://www.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PII data

2023-10-16 Thread Andreas Lehmkühler

PDFBox doesn't send any information anywhere. Everything is done locally 
on your machine.


Am 16.10.23 um 23:14 schrieb Ward Dixon:

Hello, does anyone know if PDF Box sends any information outside of my network 
from the PDF it is creating? I'm concerned about Personal Identifiable 
Information (PII) being inadvertently sent outside of my organization.




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: empty/missing pdf content

2023-10-16 Thread Andreas Lehmkühler





Am 16.10.23 um 23:43 schrieb Pados Attila:
I fixed the issue with missing input pdf file, and also re-run this test 
project with the most fresh 3.0.1-SNAPSHOT version. (oct 5th)

So far, the character distortion remains,
That isn't the most recent version. The ticket was created on Oct 7th 
and the last change was commited in Otc 13th. Please retry with a more 
recent version





image.png

original text to print is site-1-1, I added this on purpose, as it was 
failing on the real application too.
Last time I tried to reproduce the "missing content" problems, and that 
didn't worked, so I don't have a testing code either that reproduces it 
(yet).



On Sat, Oct 14, 2023 at 5:04 AM Tilman Hausherr > wrote:


Hi,

That has now been fixed as well so you don't need that call if you're
using the snapshot.

Another small thing I noticed, which didn't play a role but is weird
and
you should fix it: you didn't close the content stream before
flattening, you did so AFTER. This may or may not bring weird effects.

Tilman

On 12.10.2023 05:03, Tilman Hausherr wrote:
 > Hi,
 >
 > That one has been solved by now but it turns out I had discovered a
 > different bug than the one you have. Yours is similar to
 > PDFBOX-5489 > .
 > Please call
 >
 >

targetDoc.getDocument().setHighestXRefObjectNumber(sourceDoc.getDocument().getHighestXRefObjectNumber());
 >
 >
 > and then it works.
 >
 > @Andreas should the call that is in importPage() also be added to
 > addPage() ?
 >
 > Tilman
 >
 > On 07.10.2023 12:44, Tilman Hausherr wrote:
 >> I was able to reduce your test even further, and created an
issue in
 >> JIRA:
 >> https://issues.apache.org/jira/browse/PDFBOX-5696

 >>
 >> Tilman
 >>
 >>
 >> On 07.10.2023 11:24, Tilman Hausherr wrote:
 >>> The file "/pdf/Template.pdf" is missing in both projects.
 >>>
 >>> So it produces only one file. There is a difference, 29 has the
 >>> Poppins_Semibold font embedded and the newer one doesn't.
 >>>
 >>>
Root/Pages/Kids/[0]/Resources/XObject/Form4/Resources/Font/Poppins-SemiBold
 >>>
 >>>
 >>> Tilman
 >>>
 >>> On 03.10.2023 21:47, Pados Attila wrote:
  Hi, here is the repository with test/reproduce code:
  https://github.com/padisah/pdfboxtests

 
  Here I am reproducing a character displacement problem: text that
  includes
  '-' sign, they are shifted from position.
  There will be more cases added, with missing content.
 
 
 
  On Tue, Sep 26, 2023 at 3:04 PM Pados Attila
  mailto:attila.pa...@gmail.com>> wrote:
 
 > Hi, so far the team delayed swapping pdfbox version, so I can
only
 > work on this on my own.
 >
 > I will make a simple command line application, or a unit
test, that
 > would imitate what the webapp does, using pdfbox 3, and first
 > reproduce the error there.
 > But it may take several weeks, as I have little free time left.
 >
 > On Sun, Sep 24, 2023 at 1:33 PM Tilman Hausherr
 > mailto:thaush...@t-online.de>>
 > wrote:
 >> Please share the smallest possible code to reproduce the
problem,
 >> and
 >> additional files if needed. (Please make our life easy and test
 >> whether
 >> it can be reproduced without extra files)
 >>
 >> The AB_Manuel_Test.pdf file has the font file missing, this
does
 >> look
 >> similar to the problem fixed recently.
 >>
 >> Tilman
 >>
 >> On 20.09.2023 20:23, Pados Attila wrote:
 >>>
https://drive.google.com/file/d/1LD0joGW9OnrXFPaY-HXZkwyKfFoCIe5L/view 

 >>>
 >>>
 >>> sorry, I was in a hurry
 >>>
 >>>
 >>> On Wed, Sep 20, 2023 at 5:35 PM sahy...@fileaffairs.de
 <
 >>> sahy...@fileaffairs.de > wrote:
 >>>
  Dear Attila,
 
  both links point to the same file. The link to the PDFBox
  generated
 > one
  is missing.
 
  BR
  Maruan
 
  Am Dienstag, dem 19.09.2023 um 20:43 +0200 schrieb Pados
Attila:
 > Template pdf
 >
 >
 >

https://drive.google.com/file/d/1mbvN9RDKoesy0tJbj3GCO4VkMPjxYw5c/view?usp=sharing

Re: Looking for a Debugger that can show which incremental save an object belongs to

2023-10-07 Thread Andreas Lehmkühler




Am 07.10.23 um 06:43 schrieb John Lussmyer:

I doubt there is a way.
It's most likely that the signing code makes a MD5 checksum (or similar) 
of the file when it is signed.
If the file is changed, checking the signing will re-calculate the 
checksum and find that it is different.  There isn't any info on what 
changed, just that SOMETHING changed.

IMHO there two possible cases of manipulation ...

First, someone changed the signed part of a pdf so that the checksum is 
altered and doesn't match with the checksum when signing the pdf. In 
such cases it is hard to say which object was altered without doing a diff.


Second, someone adds some content to a signed pdf using incremental 
save. In such cases the signed part itself is still intact w.r.t the 
signature but the new one isn't if the pdf isn't signed a second time. 
In such cases the objects in question are at the end of the pdf, simply 
appended to the origin pdf.


I guess Marcs question is about the second one.

PDFBox doesn't store the information about the origin of the xref entry 
so that we are not able to mark objects added by an incremental update.


For now, TIlmans suggestion to use an editor of your choice to inspect 
the pdf is the way to go. As I said, the objects your are looking for 
are at the end of the pdf, right after the end of the origin pdf.



Andreas


On 10/6/2023 8:50 PM, Tilman Hausherr wrote:

On 06.10.2023 19:50, Marc Kaufman wrote:
I find myself debugging PDF files where Acrobat claims "Document has 
been altered or corrupted since it was signed." I would dearly love 
to see which objects belong to the last xref (color code is OK). Has 
anyone added that feature to PDF Debugger, or know where I can find 
one? Just comparing revisions is not enough, since sometimes the 
"changed" object is identical to the same object in the previous 
revision. 


I don't know of any. I research such questions the hard way, with 
NOTEPAD++.




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: how to replace MemoryUsageSetting.setupMixed(100mb) ?

2023-10-07 Thread Andreas Lehmkühler





Am 06.10.23 um 00:07 schrieb Pados Attila:

I am using something like this:

PDDocument a1doc = Loader.loadPDF(new
RandomAccessReadBuffer(resourceAsStream), () -> new
ScratchFile(MemoryUsageSetting.setupMixed(100)));

(I use it with tempFileOnly, but the rest are the same)


Be aware that all of this doesn't have any impact on the memory 
footprint if you are simply reading the pdf, such as rendering or text 
extraction.
Starting with 3.0.0 the stream cache is limited to operations like 
writing a pdf or creating/manipulating a pdf




On Thu, Oct 5, 2023 at 9:50 PM John Lussmyer  wrote:


I'm trying to update to the latest PDFBox 3.0.0.
The code was using a call to
loadPDF(file,MemoryUsageSetting.setupMixed(MB100); // 100 MB

I see that that no longer exists, but the only mention of it doesn't
seem to provide any info on how to configure an equivalent replacement?

Any suggestions?


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org






-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: RandomAccessReadBuffer performance issues with inputStreams in 3.0

2023-09-17 Thread Andreas Lehmkühler




Am 28.08.23 um 13:30 schrieb bnncdv:

When migrating from 2.0 to 3.0 I noticed some operations were very slow,
mainly the Splitter tool. With a big-ish file it would take *a lot* more
memory/cpu (jdk8).
What exactly are you doing? I've tried to reproduce the issue and I've 
bee succesful with regard to the memory footprint but I can't confirm 
the higher cpu usage.


What exactly are doing? I've splitted the PDF spec, 32Mb file with more 
than 1.300 pages, into 2 pages pdfs and it can't see any difference with 
regard to the cup usage wether I use a file or a input stream.


However, I was able to reproduce the regression with regard to the 
memory consumption and fixed/optimized it in [1]




I believe the culprit is RandomAccessReadBuffer with inputstreams. This
fully reads the stream in 4KB chunks (not a problem), however every time

We have o do that as we need random access to the file. 2.0.x does the same


createView(..) is called (on every PDPage access I think) it call a clone
RARB constructor, and all its ByteArray chunks are duplicate()'d which for
bigger files with many pages means *tons* of wasted objects + calls (even
if the underlying buf is the same). Simplifying that, for example by
reusing the parent bufferList rather than duplicting it uses the expected
cpu/memory (I don't know the implications though).

 From simple observations Splitter seems to take x4 more cpu/heap. For
example I'd assume with a 100MB file of 300 pages (normal enough if you
deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300
pages = 768 objects created+gc'd in a short time, at least.

With smaller files (few pages) this isn't very noticeable, nor with
RandomAccessReadBufferedFile (different handling). Passing a pre-read
byte[] file to RandomAccessReadBuffer works ok (minimal dupes).
RandomAccessReadBufferedFile has a builtin cache to avoid to many 
copies, see [1]



RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in
beta1. Either way, I don't think code should be copying/duping so much and
could be restructured, specially since the migration guide hints at using
RandomAccessReadBuffer for inputStreams.
Alpha3 did the same as final version 3.0.0. The removed method was 
redundant.



Also, for RARB it'd make more sense to read chunks as needed in read()
rather than all at once in the constructor I think (faster metadata
query'ing). Incidentally, may be useful to increase the default chunk size
(or allow users to set it) to reduce fragmentation, since it's going the
read the whole thing and PDFs < 4kb aren't that common I'd say.
We have to read all data as need random access to the pdf. In many case 
on of the first steps is to jump to the end of the pdf to read the cross 
reference table/stream.



(I don't have a publishable example at hand but can be easily replicated by
using the PDFMergerUtility and joining the same non-tiny PDF xN times, then
splitting it).
There has to be something special about your use case and/or pdf as I 
can't reproduce the cpu issue, see above.



Andreas


Thanks.




[1]  https://issues.apache.org/jira/browse/PDFBOX-5685

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 1.8.x End-Of-Life (EOL) Announcement

2023-08-19 Thread Andreas Lehmkühler


The Apache PDFBox Team would like to inform you that PDFBox 1.8.17
is the last release of the 1.8 branch, which has reached its end of life 
and won't be longer officially supported.


The current community mainly maintains the 2.0.x branch and the brand 
new 3.0.x branch. We recommend everyone to upgrade at least to the 2.0.x 
branch for the best experience.


[1] https://pdfbox.apache.org/2.0/migration.html
[2] https://pdfbox.apache.org/3.0/migration.html


Thanks,
The Apache PDFBox Team

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: [ANNOUNCE] Apache PDFBox 3.0.0 released

2023-08-19 Thread Andreas Lehmkühler


Hi,

@Erik thanks for the report but I guess there is a misunderstanding, see 
inline


Am 18.08.23 um 11:32 schrieb Brangs, Erik:

Hi,


-Ursprüngliche Nachricht-
Von: Andreas Lehmkühler [mailto:andr...@lehmi.de.INVALID]
Gesendet: Freitag, 18. August 2023 07:42
An: users@pdfbox.apache.org
Betreff: [ANNOUNCE] Apache PDFBox 3.0.0 released

The Apache PDFBox community is pleased to announce the release of Apache
PDFBox 3.0.0. It is available for download at:

https://pdfbox.apache.org/download.html

[...]

A migration guide is available at

https://pdfbox.apache.org/3.0/migration.html.

It is still a work in progress and we are happy to include any valuable
feedback from our community.


I was going to suggest to update the documentation to say that you can use the 
streamCache field of MemoryUsageSetting rather than using IOUtils. However, 
I've looked at the code of MemoryUsageSetting and I'm not actually sure if 
that's correct.

I think there's a bug in MemoryUsageSetting: The comment for streamCache says that it 
creates "an instance of ScratchFile using the current settings". However, the 
line

public final StreamCacheCreateFunction streamCache = () -> new 
ScratchFile(this);
This is a functional interface. No instance of ScratchFile is created 
when creating an instance of MemoryUsageSetting. It is created once the 
functional interface is used.



is executed at the start of the constructor of MemoryUsageSetting before the 
instance variables have been set. At least that's what the bytecode output from 
javap -c -p says:

   private org.apache.pdfbox.io.MemoryUsageSetting(boolean, boolean, long, 
long);
 Code:
0: aload_0
1: invokespecial #1  // Method 
java/lang/Object."":()V
4: aload_0
5: aload_0
6: invokedynamic #2,  0  // InvokeDynamic 
#0:create:(Lorg/apache/pdfbox/io/MemoryUsageSetting;)Lorg/apache/pdfbox/io/RandomAccessStreamCache$StreamCacheCreateFunction;
   11: putfield  #3  // Field 
streamCache:Lorg/apache/pdfbox/io/RandomAccessStreamCache$StreamCacheCreateFunction;
   14: iload_2

I can't read the byte code but I've double checked the behaviour when 
debbugging one of our test cases


org.apache.pdfbox.multipdf.PDFMergerUtilityTest.testJpegCcitt()

Fortunately everthing is fine ;-)

Andreas


I think the initialization of ScratchFile needs to happen at the end of the 
constructor if the settings are supposed to be used.






-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 3.0.0 released

2023-08-17 Thread Andreas Lehmkühler

The Apache PDFBox community is pleased to announce the release of Apache 
PDFBox 3.0.0. It is available for download at:


https://pdfbox.apache.org/download.html

The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is the new major release 3.0.0 of PDFBox. This release contains a 
lot of improvements, fixes and refactorings. The API is supposed to be 
stable.


A migration guide is available at

https://pdfbox.apache.org/3.0/migration.html.

It is still a work in progress and we are happy to include any valuable 
feedback from our community.


For more details on these changes and all the other fixes and 
improvements included in this release, please refer to the following 
issues on the PDFBox issue tracker at


https://issues.apache.org/jira/browse/PDFBOX.

The full release notes are available at:

https://www.apache.org/dist/pdfbox/3.0.0/RELEASE-NOTES.txt


The Apache PDFBox website can be found at:

https://pdfbox.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Border / Box around images and form elements with backgrounds

2023-08-06 Thread Andreas Lehmkühler


Please provide the source pdf you used for rendering as well.

Thanks in advance
Andreas

Am 01.08.23 um 22:30 schrieb JJ Blodgett:

It looks like the attachments were stripped out of the email. I'll try to 
include Google doc links and hope these work:

Example of bad behavior: 
https://drive.google.com/file/d/1ZU-vvZ1uTTDM0LTRhDJPwqVX5nY2dBL_/view?usp=drive_link

ARGB render image: 
https://drive.google.com/file/d/1ZwyZejehc6AdiQJHxdJ5QrsvfJbgSq9S/view?usp=drive_link
RGB render image: 
https://drive.google.com/file/d/1m7Ikf1G65HoGJSHt9PLt6TVgT5qMhpMa/view?usp=drive_link

ARGB output PDF: 
https://drive.google.com/file/d/1kb-SHEE8xS2PYTWrAgfYgmuKJMF6YUql/view?usp=drive_link
RGB output PDF: 
https://drive.google.com/file/d/1PpHVEsSGcUltKZY0Gi-Kk1kLIx9XPLIW/view?usp=drive_link



From: JJ Blodgett 
Sent: Tuesday, August 1, 2023 11:49 AM
To: users@pdfbox.apache.org 
Subject: Border / Box around images and form elements with backgrounds


EXTERNAL: Do not click links or open attachments if you do not recognize the 
sender.

We're working on converting large batches of text-based PDF documents into 
images and then back to PDF (partly to avoid font issues with certain print 
processes down the line). But we've come across an issue that's preventing us 
from moving forward.

Both with version 2.0.29 and 3.0.0, we can generate clean images with 
"PDFRenderer" and renderImageWithDPI() or similar methods. With RGB output, we 
get solid images but the size is larger than we'd like. So we try to use ARGB which 
creates a smaller / transparent background image except for 2 items we've found. Any form 
field with a transparent background and any embedded image have a non-transparent 
background. The images look clean and presumably are exactly what we need out of the 
render process.

But as soon as we try to convert the images back into a PDF by drawing the 
image to a blank document page, we end up with a border around all images and 
form fields that are non-transparent. I've included examples of both the raw 
images and the resulting PDF (as well as the source PDF). We've tried all kinds 
of things from render settings to draw settings and can't find a combination 
that changes this at all. We could address all of the form fields by removing 
backgrounds in our templates. However, we can't actually do anything to get rid 
of company logos or other images that need to appear in the documents.

Because we can't figure out how to get around this issue, we're unable to use 
ARGB and file sizes are too large to work with. If we can get ARGB to write to 
documents without the border, I think we can move forward. Any ideas on how or 
why this happens and whether there is a workaround or not?  If it matters, 
we're using Adobe Coldfusion to access java objects from a programming 
standpoint. But I'm pretty sure that's not a limiting factor. But I did notice 
that the built-in CF functions for working with PDF's do the same thing. So it 
may not have a workaround.

If there's another way to accomplish the same thing (ie end up with image-based 
pdf rather than text to avoid text interpretation issues), that would also be a 
possible solution. We can't embed fonts in the documents because the file sizes 
would then be too large to work with over the 1,000's of individual documents.




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: TextToPDF function removes the first char since 2.0.28

2023-07-27 Thread Andreas Lehmkühler

I've ran your shell script and got the same result, the first char is 
missing in the pdf.


It seems to be related to the way you are calling TextToPDF. You are 
simply print the text to the console and redirect it to TextToPDF.


I've changed that and echoed the text to a file and used that file as 
input for TextToPDF. Voila, everything works fine.


PDFBOX-5554 added support for a charset parameter and a leading UTF-8 
BOM is removed automatically. I assume the latter is the issue here. It 
reads the input twice and somehow this doesn't work with a redirected 
input on linux


Andreas

Am 25.07.23 um 08:10 schrieb michael.a...@universa.de:

the question is, where does the char got lost, when creating the pdf or when 
extracting the text?


Sorry if i was not precise enough. The created pdf misses the first char. So 
the TextToPDF function has a problem.


Did you check the created pdf? Does it contain the whole text?


I tested/viewed it. The first char is missing.


Hinweise zur Datensicherheit und zur Vertraulichkeit von E-Mails finden Sie 
hier:
https://www.universa.de/e-mail-kommunikation

Informationen zum Datenschutz und zu den Betroffenenrechten können Sie 
nachlesen unter:
https://www.universa.de/datenschutz

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: TextToPDF function removes the first char since 2.0.28

2023-07-25 Thread Andreas Lehmkühler


Hi,

the question is, where does the char got lost, when creating the pdf or 
when extracting the text?


Did you check the created pdf? Does it contain the whole text?

Andreas

Am 25.07.23 um 07:52 schrieb michael.a...@universa.de:

Hi,

the TextToPDF function worked without problems from 2.0.24 (the first version, 
i used) to 2.0.27.
I use command-line only.

Here is a test:

#!/bin/bash

jar=/usr/share/java/pdfbox-app.jar # adjust

text_in='hello'

java -jar $jar TextToPDF test.pdf <(echo "$text_in") 2>/dev/null
text_out=$(java -jar $jar ExtractText test.pdf >(cat) 2>/dev/null)

echo -e "text_in : $text_in\ntext_out: $text_out"

if [ "$text_in" != "$text_out" ]; then
   echo 'uat failed'
   exit 1
fi

echo 'uat passed'

Kind regards
Michael

Hinweise zur Datensicherheit und zur Vertraulichkeit von E-Mails finden Sie 
hier:
https://www.universa.de/e-mail-kommunikation

Informationen zum Datenschutz und zu den Betroffenenrechten können Sie 
nachlesen unter:
https://www.universa.de/datenschutz



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 3.0.0-beta1 released

2023-07-14 Thread Andreas Lehmkühler

The Apache PDFBox community is pleased to announce the release of the 
first beta release for Apache PDFBox 3.0.0. It is available for download at:


https://pdfbox.apache.org/download.html

The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is the first beta release candidate for the upcoming major release 
3.0.0 of PDFBox. This release contains a lot of improvements, fixes and 
refactorings. The API is supposed to be stable.


A migration guide is available at 
https://pdfbox.apache.org/3.0/migration.html. It is still a work in 
progress and we are happy to include any valuable feedback from our 
community.


For more details on these changes and all the other fixes and 
improvements included in this release, please refer to the following 
issues on the PDFBox issue tracker at 
https://issues.apache.org/jira/browse/PDFBOX.



The full release notes are available at:

https://www.apache.org/dist/pdfbox/3.0.0-beta1/RELEASE-NOTES.txt


The Apache PDFBox website can be found at:

https://pdfbox.apache.org/


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 2.0.29 released

2023-07-01 Thread Andreas Lehmkühler


The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 2.0.29. The release is available for download at:

https://pdfbox.apache.org/download.html

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 2.0.29

Introduction


The Apache PDFBox library is an open source Java tool for working with 
PDF documents.


This is an incremental bugfix release based on the earlier 2.0.28 
release. It contains

a couple of fixes and small improvements.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-4010] - A (rotated) barcode is missing from a pdf when printed
[PDFBOX-5587] - NullPointerException in PDTrueTypeFont.java getPath( )
[PDFBOX-5591] - Parsing of XMP metadata without optional xmpmeta element
[PDFBOX-5593] - Avoid division by 0 in shading function interpolation
[PDFBOX-5596] - MyPageDrawer#getPaint may produce 
UnsupportedOperationException

[PDFBOX-5601] - Barcode corrupted when printing document
[PDFBOX-5604] - The text in some fonts is lost when converting pdf to image
[PDFBOX-5606] - PDFTextStripper runs out of memory in 2.0.28 but not in 
2.0.27 same code
[PDFBOX-5609] - all values in the signature dictionary shall be direct 
objects

[PDFBOX-5611] - Glyphs not rendered
[PDFBOX-5612] - PDF with mangled font rendering in some environments
[PDFBOX-5614] - RadioButtons disappear when printing PDF
[PDFBOX-5620] - BitsPerComponent 16 not allowed in PDF/A-1b
[PDFBOX-5621] - NullPointerException in PDFStreamEngine.showText
[PDFBOX-5624] - Infinte loop when parsing Type1 font

Improvement

[PDFBOX-5571] - Add duplex and tray parameters to PrintPDF
[PDFBOX-5598] - Create command line utility to extract XMP data
[PDFBOX-5605] - Improve Opaque PDFRenderer example

Task

[PDFBOX-4932] - Implement /RunLengthDecode encoder
[PDFBOX-5595] - Slight regression on corrupt bug tracker file
[PDFBOX-5625] - move and update bc from jdk15on to jdk15to18

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by a SHA512 checksum and a PGP signature
that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://www.apache.org/dist/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit https://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit https://www.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: When will the next version from the 3.x line be available?

2023-06-27 Thread Andreas Lehmkühler


Hi,


Am 27.06.23 um 15:10 schrieb Brangs, Erik:

Hi,

version 2.0.28 of PDFBox was released recently. Will there also be a new 
version from the 3.x line in the near future?


First of all there will be another 2.0 release, hopefully tomorrow



Andreas Lehmkühler mentioned a possible beta1 release last month ( 
https://lists.apache.org/thread/0bgg6pd4d48qd49bxsdgvb9vsxr9r3v6 ).


Yes, due to some personal issues I had to postpone the 3.0 release. Once 
the 2.0.29 ist out I'm going to target the first beta of 3.0



Andreas




We are interested in the 3.x line because of the reduced memory usage and 
because it is tested on newer Java versions.


Mit freundlichen Grüßen
Erik Brangs
*** Suchen. Finden. Entdecken. Deutsche Nationalbibliothek ***



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Fwd: Apache in 2018 - By The Digits

2019-01-01 Thread Andreas Lehmkühler

Hi, 

Sally prepared some digits for 2018 and I was surprised to see one of our 
fellow PDFBox committers among the Top 5 committers as we are a small community 
compared to other ASF projects.

Thanks Tilman for your ongoing efforts to improve PDFBox in the last year, the 
time before that and hopefully in the future!!!

A happy new year to everyone

Cheers, Andreas 


 Ursprüngliche Nachricht 
Von: Sally Khudairi 
Gesendet: 1. Januar 2019 08:22:25 MEZ
An: Apache Announce List 
Betreff: Apache in 2018 - By The Digits

[this announcement is available online at https://s.apache.org/Apache2018Digits 
]

It's been a great year for the Apache community at-large. With nearly 200M 
lines of code under the ASF's stewardship, our ongoing success is the result of 
community-led development "The Apache Way", executed through the collaborative 
efforts of more than 300 Apache projects and their communities. Highlights 
include:

Apache Projects —https://projects.apache.org/
- Total number of projects + sub-projects - 328 (not including Apache Labs 
initiatives)
- Top-Level Projects - 198
- Podlings in the Apache Incubator - 51
- Other groups, including operations/support - 62

Community/People —http://home.apache.org/
- Apache Committers - 7,032 (6,693 active)
- ASF Members (individuals) - 730
- New Members elected - 44


Apache Projects/Code —https://projects.apache.org/statistics.html

3,208 Apache Committers changed 78,493,228 lines of code over 201,220 commits. 
We also  welcomed 4,638 new code contributors and 15,861 new issue/pull request 
contributors. 

Top 5 Apache Code Committers 
- Andrea Cosentino (2,508 commits; 237,224 lines changed)
- Jean-Baptiste Onofré (2,098 commits; 1,208,851 lines changed)
- Duo Zhang (1,956 commits; 809,085 lines changed)
- Mark Thomas (1,823 commits; 179,883 lines changed)
 - Tilman Hausherr (1,736 commits; 81,940 lines changed)

Top 5 Apache Project Repositories by Commits
 - Hadoop
 - HBase
 - Beam
 - Camel
 - Flink

Top 5 Apache Project Repositories by Size (Lines of Code)
 - OpenOffice (7,822,699)
 - NetBeans (7,741,506)
 - Flex (whiteboard: 5,233,722; SDK 3,933,522)
 - Mynewt (documentation: 4,381.072)
 - Hadoop (3,881,797)

"If it didn't happen on-list, it didn't happen." —https://lists.apache.org/

 - Total number of mailing lists 1,131
 - 19,435 authors sent 1,497,005 emails on 505,793 topics

Top 5 most active Apache user@ mailing lists
 - Flink
 - Lucene
 - Ignite
 - Cassandra
 - Kafka

Top 5 most active Apache dev@ mailing lists
 - Beam
 - Ignite
 - Kafka
 - Tomcat
 - James

Contributor License Agreements and Software Grants 
—https://www.apache.org/licenses/

We welcomed an average of 387 new code contributors and 1,250 new people filing 
issues each month. Individuals who are granted write access to the Apache 
repositories must submit an Individual Contributor License Agreement (ICLA). 
Corporations that have assigned employees to work on Apache projects as part of 
an employment agreement may sign a Corporate CLA (CCLA) for contributing 
intellectual property via the corporation. Individuals or corporations donating 
a body of existing software or documentation to one of the Apache projects need 
to execute a formal Software Grant Agreement (SGA) with the ASF. 

 - ICLAs signed - 831
 - CCLAs signed - 35
 - Software Grants submitted - 25

Sponsorship and Individual Support 
—http://apache.org/foundation/contributing.html

Thank you to our hundreds of individual donors and Sponsors whose generous 
support helps offset the ASF's day-to-day operating expenses that include 
Infrastructure, Accounting, Fundraising, Marketing & Publicity, and more.

 - Platinum: Cloudera, Comcast, Facebook, Google, LeaseWeb, Microsoft, Oath, 
Pineapple Fund, and Tencent Cloud.

 - Gold: Anonymous, ARM, Bloomberg, Handshake, Hortonworks, Huawei, IBM, 
Indeed, Pivotal, and Union Investment.

 - Silver: Aetna, Alibaba Cloud Computing, Baidu, Budget Direct, Capital One, 
Cerner, Inspur, ODPi, Private Internet Access, Red Hat, and Target.

 - Bronze: Airport Rentals, Best VPN, The Blog Starter, Bookmakers, Cash Store, 
Casino Bonus, Casino2k, Cloudsoft, Emerio, Footprints Recruiting, 
HostChecka.com, HostingAdvice.com, HostPapa Web Hosting, The Linux Foundation, 
Mobile Slots, Mutuo Kredit AG, Online Holland Casino, RX-M, SCAMS.info, Site 
Builder Report, Talend, The Best VPN, Twitter, and Web Hosting Secret Revealed.

ASF Targeted Sponsors provide the Foundation with contributions for specific 
activities or programs.

 - Targeted Platinum: DLA Piper, Microsoft, Oath, OSU Open Source Labs, and 
Sonatype.

 - Targeted Gold: Atlassian, The CrytpoFund, Datadog, PhoenixNAP, and Quenda.

 - Targeted Silver: Amazon Web Services, HotWax Systems, and Rackspace.

 - Targeted Bronze: Bintray, Education Networks of America, Google, Hopsie, 
No-IP, PagerDuty, Peregrine Computer Consultants Corporation, Sonic.net, 
SURFnet, and Virtru.


Together, our Members, Committers, contributors,

Re: Regarding retrieving COSName.getPDFName(PreflightConstants.DICTIONARY_KEY_LINEARIZED

2017-07-25 Thread Andreas Lehmkühler

> karthick g  hat am 25. Juli 2017 um 10:34 
> geschrieben:
> 
> 
> Hi team,
> 
> Based on the analysis I have found one thing regarding Linearized PDF in
> 2.0 and above versions of PDFBox.
> 
> COSDocument cDoc = pdDoc.getDocument();
> List lObj = cDoc.getObjects();
> for (COSObject object : lObj)
> {
> System.out.println(object.getObjectNumber());
>}
> 
> Based on the code am retrieving  cosobject numbers of PDFDocument
> which prints COSObjects sequentially...
> PDF 1.8.2 and 2.0.6 works same  except the fact that COSObject pointing to
> Linearized dictionary is not added.
> 
> 748 0 obj < 824]>> endobj
> 
> The 748, 0 which is present in 1.8.2 is not present in 2.0.6. Is the
> finding is correct and can you guide me to fix it.
There is no fix as it isn't a real problem.

The standard procedure to read a pdf is to start at the end of the document and 
to read the trailer information including the xref table. The linearized 
dictionary is optional and not needed if a parser follows the default procedure.
2.0.x follows the default way which omits the optional linearized dictionary. 
All older versions are using some kind of brute force search (the sequential 
parser only), which reads all objects from the beginning to the end which 
includes the linearized dictionary.

The dictionary won't be read as long as PDFBox 2.0.x doesn't support 
"linearized parsing". AFAIK none of the devs is working on that feature as we 
simply don't need it. But we are happy to accept patches to add support for it.

Andreas

> If it is fixed I can able to retrieve Linearized dictionary without going
> for preflight jar,
> 
> PDFBox 1.8.2
> ===
> COSObject{1, 0}
> -
> --
> ---
> COSObject{747, 0}
> COSObject{748, 0}
> COSObject{749, 0}
> -
> 
> PDFBox 2.0.6
> ===
> COSObject{1, 0}
> -
> --
> ---
> COSObject{747, 0}
> COSObject{749, 0}
> -
> 
> Regard,
> Karthick G
> 
> On Thu, Jul 13, 2017 at 12:02 PM, karthick g 
> wrote:
> 
> > Hi Team,
> >
> > In our project we want to take the Linearised dictionary. Before these 2.0
> > versions,
> > We can able to get that dictionary by normal workarounds that without
> > loading preflight document. Now after 2.0 versions we have to load the
> > preflight document to get the linearized property. Which resulting in
> > additional work around and which cost the project performance. Will their
> > be a workaround in next release, Such that linearized property can be
> > retrieved without loading Preflight document.
> >
> > Regards,
> > Karthick G
> >

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFBox JPEG2000 and Tomcat

2017-07-25 Thread Andreas Lehmkühler


> Chris Gamache  hat am 25. Juli 2017 um 03:10 geschrieben:
> 
> 
> I also recall one thread on SO where the developer had kept the scope on the 
> imageio jars set to `test` as it is in PDFbox's pom. I wish it were a 
> contributing factor here because it is an easy fix.
> 
> What do you know about SPI? Can I prophylactically re-add the SPI for 
> JPEG2000 in a safe way? I don't think the visibility of that registry is 
> available way way up the call stack. Maybe there's a way I haven't found?
> 
According to [1] java.util.ServiceLoader is the class you are looking for

Andreas
[1] https://docs.oracle.com/javase/tutorial/ext/basics/spi.html

> 
> > On Jul 24, 2017, at 3:46 PM, Tilman Hausherr  wrote:
> > 
> > http://markmail.org/ offers a search engine for the user mailing list, but 
> > I haven't been able to find it either. One person had a problem but the 
> > cause was a bad pom file. The one you posted didn't have that problem. 
> > Maybe my memory was from a stackoverflow question.
> > 
> > Tilman
> > 
> > 
> > -
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

2017-07-14 Thread Andreas Lehmkühler

You are looking at the wrong place. pdfbox-app is just a meta project to create 
a convience binary of all relevant subprojects. It doesn't contain any source 
code.

The source code you are looking for is here:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/2.0.7-SNAPSHOT/

Andreas

> d.ham...@aurenz.de hat am 14. Juli 2017 um 11:05 geschrieben:
> 
> 
> Hi,
> 
> I talking about the snapshot versions provided here:
> 
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> 
> Can you tell me were to download jars containing source files? The source 
> jars there just contain the META-INF directory but nothing else.
> 
> Thank you!
> 
> -Ursprüngliche Nachricht-
> Von: Gilad Denneboom [mailto:gilad.denneb...@gmail.com] 
> Gesendet: Freitag, 14. Juli 2017 11:03
> An: users@pdfbox.apache.org
> Betreff: Re: Splitter.createNewDocument() always uses main memory only - this 
> leads to out of memory when splitting large documents
> 
> You don't need a decompiler... PDFBox is an open-source library. All the code 
> is available online.
> 
> On Fri, Jul 14, 2017 at 10:39 AM,  wrote:
> 
> > Hi Tilman,
> >
> > I used a decompiler to have a look at the sources.
> >
> > Perhaps it would be a good idea to set Splitter() deprecated
> >
> > @deprecated
> > public Splitter() {}
> >
> > public Splitter(MemoryUsageSetting memoryUsageSetting) {
> > this.memoryUsageSetting = memoryUsageSetting;
> > }
> >
> >
> > to point people to the improvement before they fall into the out of 
> > memory hole themselves.
> >
> >
> > Please add a program argument to PDFSplit.split() like so:
> >
> >if (args[i].equals("-memory")) {
> > if (++i >= args.length) {
> > PDFSplit.usage();
> > }
> > if (args[i].equals("tempFile")) {
> >   memoryUsageSetting = .
> > } else if (args[i].equals("mainMemory")) {
> >   memoryUsageSetting = .
> > } else if (args[i].equals("mixed")) {
> >   memoryUsageSetting = .
> > } else {
> >   PDFSplit.usage();
> > }
> > continue;
> > }
> >
> > Perhaps it would be a good idea to even make "maxMainMemoryBytes" and 
> > "maxStorageBytes" configurable, too.
> >
> > Thanks a lot - I really appreciate your great work and support!
> >
> > Cheers,
> >
> > Daniel
> >
> >
> > -Ursprüngliche Nachricht-
> > Von: Tilman Hausherr [mailto:thaush...@t-online.de]
> > Gesendet: Donnerstag, 13. Juli 2017 21:21
> > An: users@pdfbox.apache.org
> > Betreff: Re: Splitter.createNewDocument() always uses main memory only 
> > - this leads to out of memory when splitting large documents
> >
> > See
> > https://issues.apache.org/jira/browse/PDFBOX-3869
> >
> > and try a snapshot from
> > https://repository.apache.org/content/groups/snapshots/org/
> > apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> > (at the bottom)
> >
> > Please give feedback whether this is what you wanted. Please do it 
> > quickly because a new version will be built on monday so either I'd 
> > have to revert before or we'll be stuck with this API.
> >
> > Re: a global configuration - maybe at a later time. I'm not THAT 
> > convinced that it is needed.
> >
> > Tilman
> >
> >
> > Am 13.07.2017 um 09:20 schrieb d.ham...@aurenz.de:
> > > Hi dear contributors to pdfbox,
> > >
> > > I just would like to report that Splitter.createNewDocument() should 
> > > be
> > able to consider different MemoryUsageSetting configurations.
> > >
> > > In version 2.0.6 this method is implemented as
> > >
> > >
> > > protected PDDocument createNewDocument() throws IOException
> > >  {
> > >  PDDocument document = new PDDocument();
> > >  document.getDocument().setVersion(getSourceDocument()
> > .getVersion());
> > >  document.setDocumentInformation(getSourceDocument().
> > getDocumentInformation());
> > >  document.getDocumentCatalog().setViewerPreferences(
> > >  getSourceDocument().getDocumentCatalog().
> > getViewerPreferences());
> > >  return document;
> > >  }
> > >
> > >
> > >
> > > I would suggest to introduce a member variable "MemoryUsageSetting
> > memSetting" that can be set for each instance of "Splitter".
> > >
> > > This way createNewDocument() could be implemented as
> > >
> > >
> > > protected PDDocument createNewDocument() throws IOException
> > >  {
> > >  PDDocument document = new PDDocument(this. memSetting);
> > >  document.getDocument().setVersion(getSourceDocument()
> > .getVersion());
> > >  document.setDocumentInformation(getSourceDocument().
> > getDocumentInformation());
> > >

Re: UTF16 encoded string to PDFDocEncoding

2017-07-11 Thread Andreas Lehmkühler


> Andreas Lehmkühler <andr...@lehmi.de> hat am 11. Juli 2017 um 12:17 
> geschrieben:
> 
> 
> 
> > Andrea Vacondio <andrea.vacon...@gmail.com> hat am 10. Juli 2017 um 19:22 
> > geschrieben:
> > 
> > 
> > Hi, we came across this case where we are basically cloning outline items
> > where the original outline title is a UTF16BE encoded text string
> > containing the value 00A0 (non break space). We later use the string to
> > assign the title in a new outline item and the A0 is recognised as a € sign.
> > Here is a simple test:
> > 
> > COSString victim = COSString
> > .parseHex("FEFF004300680061007000740065007200A0");
> > PDOutlineItem node = new PDOutlineItem();
> > node.setTitle(victim.getString());
> > 
> > If you look at the node dictionary you'll see that the title value is
> > Chapter€
> How do you look at the dictionary?
> 
> The following code:
> COSString victim = COSString.parseHex( "FEFF004300680061007000740065007200A0" 
> );
>   System.out.println( victim.toHexString() );
>   System.out.println( victim.getString() );
Ups, something is missing 

The output looks good to me:
FEFF004300680061007000740065007200A0
Chapter 
Note the second line ends with a space


Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: catch(IOException | COSVisitorException e)

2017-06-26 Thread Andreas Lehmkühler


> Steve Carr  hat am 26. Juni 2017 um 11:41 
> geschrieben:
> 
> 
>   import java.io.IOException;
> import org.apache.pdfbox.exceptions.COSVisitorException;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> 
> /**
>  *
>  * @author Azeem
>  * @Email az...@radixcode.com
>  */
> 
> 
> When I compile the following code in netbeans I get
>  Uncompilable source code - package org.apache.pdfbox.exceptions does not 
> exist in relation tocatch(IOException | COSVisitorException e)
> 
> I downloaded pdfbox-1.6.0-src.zip
Please, update to a more recent version like 2.0.6, yours is quite ancient.

Andreas

> help
> steve
> public class Main {
> 
> 
> public static void main(String[] args) {
> 
> System.out.println("Create Simple PDF file with blank Page");
> 
> String fileName = "EmptyPdf.pdf"; // name of our file
> 
> try{
> 
> PDDocument doc = new PDDocument(); // creating instance of pdfDoc
> 
> doc.addPage(new PDPage()); // adding page in pdf doc file
> 
> doc.save(fileName); // saving as pdf file with name perm 
> 
> doc.close(); // cleaning memory 
> 
> System.out.println("your file created in : "+ 
> System.getProperty("user.dir"));
> 
> 
> }
> catch(IOException | COSVisitorException e){
> System.out.println(e.getMessage());
> }
> 
> }
> 
> }

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDPageContentStream#close() vs PDDocument#close()

2017-06-22 Thread Andreas Lehmkühler


> Thad Humphries  hat am 21. Juni 2017 um 23:30 
> geschrieben:
> 
> 
> Is it necessary to call PDDocument#close() after calling
> PDPageContentStream#close()? Does the answer apply all cases or only
> certain cases? If the latter, what certain cases?
> 
> For example, in the following code snippet:
> 
> PDDocument  document = new PDDocument();
> PDPageContentStream cos = new PDPageContentStream(document, 0);
> cos.drawImage(... , etc.
> cos.close();
> document.close();
> 
> 
> Is the last line, `document.close()`, necessary, or has that been handled
> sufficiently by the `cos.close()` immediately before it?
You have to close both, PDDocument at the end.

> Finally, I'm assuming that it is safe to call #close() a second (or third?)
> time on a PDDocument or PDPageContentStream. Is that correct? My use case
> would be in the finally block where an exception might have left PDDocument
> or PDPageContentStream open.
It shouldn't be a problem to close PDDocument several times. I'm not sure about 
PDPageContentStream

Andreas
> 
> -- 
> "Hell hath no limits, nor is circumscrib'd In one self-place; but where we
> are is hell, And where hell is, there must we ever be" --Christopher
> Marlowe, *Doctor Faustus* (v. 121-24)

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Help identifying hair-lines in PDFs using PDFBox and tabula

2017-05-23 Thread Andreas Lehmkühler

> Gilad Denneboom  hat am 22. Mai 2017 um 22:07 
> geschrieben:
> 
> 
> Hi all,
> 
> So I'm trying to identify hair-lines in my PDFs. I came across tabula,
> which seems to be able to do it, but I can't get it to quite work with my
> files in the way I need it to, so any help is greatly appreciated!
> 
> Here's what I've been doing so far: I used the Ruling object from tabula to
> extract both the horizontal and vertical rules from a stripped version of
> the PDF page (ie, after removing all the text in it).
> I'm getting results but now I want to relate them back to the original PDF
> page, and that's proving difficult. If I add a text field using the
> coordinates of the Ruling objects they are way off then where I would
> expect them to be. I think it has to do with the DPI setting used to
> convert the PDF page to an image, which is necessary for the rulings
> extraction.
> So my question is: How can I take these Ruling objects and convert them
> back to the original coordinates of the PDF?
> I would also like to be able to only identify lines of a certain width and
> height, but if I get the rectangles to work correctly I think I can do that
> in post-processing.
Sounds like a question for the tabulapdf community ...

Andreas
> 
> Thanks in advance!
> Gilad

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Linearized dictionary

2017-05-22 Thread Andreas Lehmkühler

> karthick g  hat am 22. Mai 2017 um 06:17 geschrieben:
> 
> 
> Hi team,
> 
> Here is the code, I am using COSName.getPDFName("Linearized). The problem
> is
> 
> PDDocument pdDoc = PDDocument.load(new File(""));
> COSDocument cosDoc = pdDoc.getDocument();
> List lObj = cosDoc.getObjects();
> for (Object object : lObj) {
> 
> COSBase curObj = ((COSObject) object).getObject();
> if (curObj instanceof COSDictionary) {
> 
> COSDictionary cOSDictionary = (COSDictionary) curObj;
> 
> if
> (cOSDictionary.keySet().contains(COSName.getPDFName("Linearized"))) {
> //System.out.println("Linearized");
> }
> }
> }
> 
> While using 1.8.2 Linearized is working properly. But in 2.0.5 I can not
> get the linearized and I can't check the linearized as it is not in the
> dictionary keyset. Please let me know if you need more details.
I can confirm the behaviour. The object is read but not dereferenced as it 
isn't needed. Consequently that dictionary isn't part of the object pool.
I have no solution yet 

Andreas
> 
> 
> 
> 
> Regards,
> Karthick G
> 
> On Fri, May 19, 2017 at 9:27 AM, karthick g  wrote:
> 
> > Hi,
> > * I need to Check whether my PDF file is Linearized or not, for fast view
> > web. *
> > In the previous version (1.8.2) of PDFBox Linearized is in the COSName. I
> > will get the COSDictionary and check whether Linearized is available in the
> > COSName and conclude the PDF is suited for fast web view. Now Linearized
> > keyword is not in
> > the List of COSName. How can I get the Linearized dictionary in PDFBox.
> > Please let me know if you need more details.
> >
> > Regards,
> > Karthick G
> >
> >
> >
> > On Thu, May 18, 2017 at 9:17 AM, karthick g 
> > wrote:
> >
> >> Hi team,
> >>
> >> I am a long time user of PDFBox. We starts to migrate pdfbox from 1.8.2
> >> to 2.0.5.
> >> During migration I found that Linearized dictionary moved to preflight
> >> jar.
> >> I created the PDDocument based on preflight context which is returning
> >> null.
> >> Since the PDDocument is null I can't proceed further. What is the right
> >> way to
> >> get Lineraized dictionary in the current version of PDFBox . Please guide
> >> me.
> >> Please let me know if you need more details.
> >>
> >> Regards,
> >> Karthick G
> >>
> >>
> >>
> >>
> >

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

RE: creating fillable forms, possibly in/from existing PDF file?

2017-05-18 Thread Andreas Lehmkühler


> Gary Grosso  hat am 18. Mai 2017 um 05:09 
> geschrieben:
> 
> 
> Thanks for your reply, Tilman.
> 
> I see PDFBox allows for text field/area (single or multi-line), list box, 
> combo box, check box, push button, and radio button.
> 
> Would it be reasonable to say that implementing a date picker in an acroform 
> should be possible, but would require JavaScript (e.g., using 
> PDFormFieldAdditionalActions class)?
> 
> By the way, it seems it is no longer possible to search the archives. For 
> example, this:
> 
> http://www.mail-archive.com/search?a=1=pdfbox-users%40incubator.apache.org=PDDocument=16=9===1y=2016-05-15==relevance
> 
> results in:
> 
> "No matches were found for PDDocument date:[20150516 TO 20170515]"
You are using the wrong ml-address, replace "pdfbox-us...@incubator.apache.org" 
with "users@pdfbox.apache.org". The former one was deprecated in 2009 when 
pdfbox graduated to a top level project.

Andreas
> 
> which I know not to be true.
> 
> Am I searching incorrectly or in the incorrect location?
> 
> Thanks,
> Gary
> 
> -Original Message-
> From: Tilman Hausherr [mailto:thaush...@t-online.de] 
> Sent: Wednesday, May 17, 2017 1:15 PM
> To: users@pdfbox.apache.org
> Subject: Re: creating fillable forms, possibly in/from existing PDF file?
> 
> Am 17.05.2017 um 18:34 schrieb Gary Grosso:
> > Hi,
> >
> > I am exploring a requirement to generate PDF fillable forms.
> >
> > A major decision is whether to start with our current PDF (created with a 
> > derivative of PStill), or build it with PDFBox from scratch. In the past I 
> > added bookmarks to our existing PDF files using PDFBox.
> >
> > I see in the examples 
> > (https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/interactive/)
> >  that a PDAcroForm instance is being added to a newly created empty 
> > PDDocument.
> >
> > Is it worth considering modifying an existing PDF document to add fillable 
> > form fields? Or should I not waste time with that approach and plan to 
> > create a new document from scratch?
> 
> IMHO it doesn't make much difference because acroform is separate from the 
> rest. Creating fields with PDFBox is always tricky.
> 
> Tilman
> 
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: OTFParser how to

2017-04-24 Thread Andreas Lehmkühler


> clifford  hat am 19. April 2017 um 18:12 
> geschrieben:
> 
> 
> When doing..
> java.io.FileInputStream fis = new java.io.FileInputStream(file1);
> OTFParser p = new OTFParser();
> OpenTypeFont otf = p.parse(fis);
> 
> and otf.isPostScript() is true how do I embed the font  as
> 
> PDType0Font.load(doc, otf, true); will cause a error later on of 
> "java.lang.UnsupportedOperationException: OTF fonts do not have a glyf 
> table" which is true as it dose not have a glyf table
PDFBox doesn't support embedding such otf fonts.

Andreas
> 
> 
> -- 
> 
> *Kind regards*
> 
> *Clifford Dann
> Paprika*
> 
> 
> T +44 (0)1732 811603
> www.paprika-software.com 
> Latters House, High Street, Hadlow, Tonbridge, Kent, TN11 0EF, United 
> Kingdom
> 
> Agency Software Worldwide Ltd.Registered in England and Wales 01665695
>

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: converting hex to PDColor

2017-03-13 Thread Andreas Lehmkühler


> chitgoks  hat am 13. März 2017 um 11:27 geschrieben:
> 
> 
> hi again
> 
> a little assistance regarding converting hex to PDColor.
> 
> please take this example #ff8000
> 
> and this is my code
> 
> String colorStr = "#ff8000";
> java.awt.Color rgb = new java.awt.Color(
> Integer.valueOf(colorStr.substring(1, 3), 16),
> Integer.valueOf(colorStr.substring(3, 5), 16),
> Integer.valueOf(colorStr.substring(5, 7), 16))
> 
> PDColor pdcolor = new PDColor(new float[] { rgb.getRed() / 255,
> rgb.getGreen() / 255, rgb.getBlue() / 255}, PDDeviceRGB.INSTANCE);
You can omit the PDColor step as those int values are already the values you 
are looking for
red = Integer.valueOf(colorStr.substring(1, 3), 16) and so on.

> 
> the result is pink-ish (the wrong color), instead of orange-ish (the
> correct color).
Where do you see that wrong color? In the resulting PDF? If the latter, please 
share the doc with us.

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 2.0.2 released

2016-06-09 Thread Andreas Lehmkühler


The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 2.0.2. The release is available for download at:

http://pdfbox.apache.org/download.cgi

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 2.0.2

Introduction


The Apache PDFBox library is an open source Java tool for working with PDF 
documents.


This is an incremental bugfix release based on the earlier 2.0.1 release. It 
contains

a couple of fixes and small improvements.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-3267] - Using threads results in different images
[PDFBOX-3326] - Issue in RenderingMode.isStroke method
[PDFBOX-3327] - IndexOutOfBoundsException when retrieving kerning information
[PDFBOX-3332] - Apache PDFBox Form Fill TrueType text spacing issue
[PDFBOX-] - Wrong appearance generation for rotated AcroForms fields
[PDFBOX-3336] - several errors in the incremental save
[PDFBOX-3338] - CCITT Fax decoder fails
[PDFBOX-3341] - currentAccessPermission.setReadOnly() not set in 
StandardSecurityHandler

[PDFBOX-3346] - Create example with empty signature
[PDFBOX-3347] - COSName parsing doesn't handle ISO-8859-1 encoded bytes
[PDFBOX-3348] - NPE in Type1Parser.parseBinary
[PDFBOX-3351] - NPE when drawing annotation with empty border color array
[PDFBOX-3354] - PDCIDFont.getAverageFontWidth always returns 0
[PDFBOX-3355] - PDPageLabels.getLabelsByPageIndices() returns Uppercase letters 
for style a

[PDFBOX-3360] - java.lang.IllegalArgumentException: dash lengths all zero
[PDFBOX-3362] - PageLayout.TwoColumnRight was Illegal
[PDFBOX-3363] - Leftover file in temp directory when signing
[PDFBOX-3368] - ContainsKey don't work for the Map returned by 
PDStructureTreeRoot.getRoleMap

[PDFBOX-3369] - Error expected floating point number actual='0.00-35095424'

Improvement

[PDFBOX-3089] - Investigate why glyph path caching does not always cache glyph 
accesses

[PDFBOX-3316] - Add comment to PDF
[PDFBOX-3329] - Create PDFMergerUtility example with improved metadata handling
[PDFBOX-3342] - Add example to jump to a local page to AddAnnotations
[PDFBOX-3352] - Calendar values are parsed with unknown timezones
[PDFBOX-3364] - PDModel.getSignatureFields() only returns top level signature 
fields


Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by SHA1 and MD5 checksums and a PGP
signature that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://svn.apache.org/repos/asf/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit http://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit http://www.apache.org/


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Extracting ".pam" image files

2016-06-07 Thread Andreas Lehmkühler

> "OYEBISI, Daniel"  hat am 7. Juni 2016 um 10:41
> geschrieben:
> 
> 
> Hello,
> 
> I have a PDF document containing images of the format type ".pam".  I have
> checked the API doc but I haven't seen anything related to ".pam" files.
> Please can anyone guide me on how to do extract the ".pam" image files from my
> document?
> 
Where do you get that type information from? Is it possible to get a hand on the
pdf in question?

> Thanks in advance

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Numbers get reversed sometimes during conversion

2016-06-02 Thread Andreas Lehmkühler

> Shyam Sundar  hat am 2. Juni 2016 um 09:18
> geschrieben:
> 
> 
> Hi,
> 
> Wondering if you got a chance to check this ...

First thing to be done in such cases is to do the "Adobe Reader test". It fails,
the text can't be extracted using Acrobat Reader, so we are better ;-) Anyway,
mixed LTR/RTL text is always tricky to handle, see PDFBOX-2252

No solution so far, if there is any at all.

BR
Andreas
> 
> Thanks.
> 
> On Wed, Jun 1, 2016 at 1:48 AM, Shyam Sundar  wrote:
> 
> > Hi Andreas,
> >
> > I have just uploaded the files at below location -
> >
> >
> > https://ftp.emc.com/action/login?domain=ftp.emc.com=7rdPxvIJU=mKymXA2KyB
> >
> > I tried both, but whether I sort or not doesn't make any difference in the
> > output.
> >
> > Thanks,
> > Shyam
> >

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Numbers get reversed sometimes during conversion

2016-05-31 Thread Andreas Lehmkühler

Hi,

> Shyam Sundar  hat am 31. Mai 2016 um 12:00
> geschrieben:
> 
> 
> Hi,
> 
> I have come across an issue wherein while trying to covert PDFs (mainly of
> RTL languages) into TXT, the numbers get reversed.
> 
> Please check the attached file, '2005' in heading has become '5002'.
The file didn't make it due to some restrictions.

> It happens with the latest version too. Is this a bug ?
Do you use the sorting option?

BR
Andreas
> This is a PDF/A format by the way. Hope it is fully supported.
> 
> Thanks in advance.
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFBox*.tmp files not deleted by PDFParser

2016-05-25 Thread Andreas Lehmkühler

> Damien Butaye  hat am 25. Mai 2016 um 10:01
> geschrieben:
> 
> 
> Hello Tilman,
> 
>  Yes I did it. I verified in debug mode and this method (close() on
> SignatureOption) is well reached but the close() method of the object
> "RandomAccessBufferedFileInputStream" is never called, so the tmp file is
> never deleted.
The RandomAccessBufferedFileInputStream is used for the whole document and for
the SignatureOptions. You already confirmed that you close the latter one. What
about the document itself, did you close it as well? 

> The patch [PDFBOX-2723] adds a line in the method parseXrefObjStream of the
> COSParser but this method is not called. It seems this method is called
> only in case of "xref stream" (line 286 COSParser) , why not in other case
> -> xref (line 218 COSParser)?
Neither a xrefstream nor a xref table uses its own
RandomAccessBufferedFileInputStream, so that this question doesn't seem to be
related to your problem.

BR
Andreas

> Damien.
> 
> 2016-05-24 20:10 GMT+02:00 Tilman Hausherr :
> 
> > did you close the options object?
> >
> > Tilman
> >
> >
> > Am 24.05.2016 um 15:51 schrieb Damien Butaye:
> >
> >> Dear all,
> >>
> >>I'm trying to add a signature to a PDF using PDFBOX 2.0.1. During the
> >> process, a tmp file (e.g: tmpPDFBoxXXX.pdf) is stored inside the /tmp
> >> directory (RehHat server). This file is not deleted after completion.
> >> After some checks, it seems that the object responsible of the file
> >> creation is  "RandomAccessBufferedFileInputStream(InputStream is)". This
> >> object is used by the PDFParser object which doesn't close the stream
> >> after
> >> completion.
> >>
> >> The release note 2.0.0 [PDFBOX-2723] seems to handle this bug by adding
> >> the
> >> following line (see https://issues.apache.org/jira/browse/PDFBOX-2723) in
> >> the COSParser :
> >>
> >> xrefStream.close(); // <--- *** NEW LINE ***
> >>
> >> But, in debug mode, I saw this line is never reached so the stream is not
> >> closed and the tmp file is not deleted. Has anybody a workaround to handle
> >> this ?
> >> Thanks for your help!
> >>
> >>
> >
> > -
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> >
> >

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

New mail archives interface

2016-05-19 Thread Andreas Lehmkühler

Hi,

at the recent ApacheCon NA conference, the new mail archives interface was
unveiled.

See https://lists.apache.org/


BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PdfParser giving garbage character

2016-05-13 Thread Andreas Lehmkühler

> Mohit Goyal  hat am 13. Mai 2016 um 08:28 geschrieben:
> 
> 
> Hi,
> 
> I have one pdf which has data in Malyalam(Indian Language). I tried to parse
> this data using apache Tika I got garbage character '?' in output.
> 
> 
> I checked Pdf using pdffont utility seems like some tounicodetable is missing.
> Output of pdffont
> Config Error: No display font for 'Symbol' Config Error: No display font for
> 'ZapfDingbats'
> **name type  emb sub uni object
> I**D
>  - --- --- --- -
> YTLJPR+AnjaliOldLipi CID TrueType  yes yes yes   1671  0
> Times-Roman  Type 1no  no  no1672  0
> Times-Bold   Type 1no  no  no 127  0
> 
> 
> Please find attached pdf.
The pdf didn't make it due to some restrictions to the mailing list. You have to
provide a link to a public download.
> 
> Code:
> 
> BufferedWriter writer=  Files.newWriter(new
> File("file-output.txt"), Charset.forName("UTF-8"));
> BodyContentHandler handler = new BodyContentHandler(writer);
> ParseContext pcontext = new ParseContext();
> Metadata metadata = new Metadata();
>PDFParser pdfparser = new PDFParser();
>pdfparser.parse(inputstream, handler, metadata,pcontext);
> 
> Any suggestions??
Are you sure that you are using PDFBox. The code doesn't look like ours.
> 
> Thanks
> Mohit Goyal

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 2.0.1 released

2016-04-26 Thread Andreas Lehmkühler

The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 2.0.1. The release is available for download at:

http://pdfbox.apache.org/download.cgi

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 2.0.1

Introduction


The Apache PDFBox library is an open source Java tool for working with PDF
documents.

This is an incremental bugfix release based on the earlier 2.0.0 release. It
contains
a couple of fixes and small improvements.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-3272] - Loaded fonts file descriptors open after closing document
[PDFBOX-3273] - Fonts not rendered correctly
[PDFBOX-3276] - Double encryption dictionary for files with XRef stream
[PDFBOX-3279] - PDDocument.importPage creates two inputstreams
[PDFBOX-3281] - HTML output wrongly specifies UTF-16 in header
[PDFBOX-3286] - Think I found a bad constant (TTF) value and constant use in
PDFBox source
[PDFBOX-3292] - Error reading stream, expected='endstream' actual='' in
non-truncated files
[PDFBOX-3297] - Infinite loop
[PDFBOX-3299] - TIFF-files with FillOrder=2 can't be converted to PDF
[PDFBOX-3301] - NPE in PDAcroForm.flatten if a widget doesn't contain a /P entry
[PDFBOX-3303] - setWidgets should set connection to parent
[PDFBOX-3308] - Missing endOfName chars
[PDFBOX-3312] - NPE in saveIncremental() / fix javadoc
[PDFBOX-3317] - Merged PDF/A files no longer valid PDF/A
[PDFBOX-3319] - Chinese character overlap other chinese character

Improvement

[PDFBOX-3275] - Show glyph bounds in DrawPrintTextLocations
[PDFBOX-3289] - Wrong unit MM_PER_INCH in PDRectangle
[PDFBOX-3295] - Improve parsing performance of object streams
[PDFBOX-3305] - PDPageContentStream should allow drawing images at current
position
[PDFBOX-3307] - Enable AES128 encryption
[PDFBOX-3323] - Cannot set destination meta data in PDFMergerUtility

Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by SHA1 and MD5 checksums and a PGP
signature that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://svn.apache.org/repos/asf/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit http://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit http://www.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 1.8.12 released

2016-04-26 Thread Andreas Lehmkühler

The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 1.8.12. The release is available for download at:

http://pdfbox.apache.org/download.cgi

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 1.8.12

Introduction


The Apache PDFBox library is an open source Java tool for working with PDF
documents.

This is an incremental bugfix release based on the earlier 1.8.11 release. It 
contains a couple of fixes and small improvements.

For more details on all fixes included in this release, please refer to the
following
issues on the PDFBox issue tracker at
https://issues.apache.org/jira/browse/PDFBOX.

Bug

[PDFBOX-1995] - AdobePDFSchema.getProducer() returns empty string
[PDFBOX-2428] - An error occured when reading table hmtx
[PDFBOX-3024] - Preflight validation call PDType0Font.clear at the wrong time
[PDFBOX-3116] - COSNumber NumberFormatException for large number
[PDFBOX-3201] - Skip zlib-header and checksum to avoid DataFormatException
[PDFBOX-3204] - JVM crashes on PDFRenderer.renderImageWithDPI
[PDFBOX-3217] - PdfaExtensionHelper.populatePDFAPropertyType
[PDFBOX-3226] - No such Element Exception processing File
[PDFBOX-3229] - Decryption fails when Metadata not encrypted but EncryptMetadata
is true/default.
[PDFBOX-3235] - ColorSpace validation fails for inlined image
[PDFBOX-3237] - ASCII85Filter does not use or recognize the correct end-of-data
terminator
[PDFBOX-3254] - Corrupted XMP causes java.lang.StringIndexOutOfBoundsException
[PDFBOX-3257] - XMPSchemaBasic setCreateDate and setModifyDate don't work if
already set
[PDFBOX-3258] - XMPBox XMPBasicSchema setters don't work if already set
[PDFBOX-3259] - ClassCastException in PDTilingPattern.getContents
[PDFBOX-3285] - All lines that use a given font stop rendering if 'Ã¶' is
inserted - 
ArrayIndexOutOfBoundsException in TTFSubFont.buildPostTable
[PDFBOX-3297] - Infinite loop
[PDFBOX-3299] - TIFF-files with FillOrder=2 can't be converted to PDF
[PDFBOX-3308] - Missing endOfName chars
[PDFBOX-3321] - ASCII stream data size is increased when written

Improvement

[PDFBOX-1840] - Automatically load isartor for preflight tests
[PDFBOX-3196] - Update maven plugins and apache parent pom
[PDFBOX-3231] - Update PDPropBuildDataDict
[PDFBOX-3251] - Improve parsing and validation of ColorSpace for inline image
[PDFBOX-3295] - Improve parsing performance of object streams

Wish

[PDFBOX-3241] - return original PDF Header


Release Contents


This release consists of a single source archive packaged as a zip file.
The archive can be unpacked with the jar tool from your JDK installation.
See the README.txt file for instructions on how to build this release.

The source archive is accompanied by SHA1 and MD5 checksums and a PGP
signature that you can use to verify the authenticity of your download.
The public key used for the PGP signature can be found at
https://svn.apache.org/repos/asf/pdfbox/KEYS.

About Apache PDFBox
---

Apache PDFBox is an open source Java library for working with PDF documents.
This project allows creation of new PDF documents, manipulation of existing
documents and the ability to extract content from documents. Apache PDFBox
also includes several command line utilities. Apache PDFBox is published
under the Apache License, Version 2.0.

For more information, visit http://pdfbox.apache.org/

About The Apache Software Foundation


Established in 1999, The Apache Software Foundation provides organizational,
legal, and financial support for more than 100 freely-available,
collaboratively-developed Open Source projects. The pragmatic Apache License
enables individual and commercial users to easily deploy Apache software;
the Foundation's intellectual property framework limits the legal exposure
of its 2,500+ contributors.

For more information, visit http://www.apache.org/

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Cannot comment on Jira issues anymore

2016-04-22 Thread Andreas Lehmkühler

Hi,

> alexander.kriegi...@extern.sdv-it.de hat am 22. April 2016 um 09:50
> geschrieben:
> 
> 
> Sorry to bother everyone here on the mailing list, but something seems to 
> be wrong in Jira: I cannot comment on 
> https://issues.apache.org/jira/browse/PDFBOX-3323 and other issues 
> anymore, the comment button has vanished.
Infra changed the auth settings for JIRA due to lot of spam. According to a
discussion on infra@ they are working on a solution to be able revert that
change. They expect to get this done within the next 24 - 48 hours at most. 

I've added your JIRA-account to the contributor-group so that you should be able
to comment again.

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: pdfbox-android

2016-04-20 Thread Andreas Lehmkühler

> Paul Mitchell  hat am 20. April 2016 um 10:47
> geschrieben:
> 
> 
> Hi
> 
> I’m not sure if I’ve come to the right spot for my question. Hopefully you can
> help me or direct me to someone who can help me
> 
> I’m currently using pdfbox-android with android studio
> compile ‘org.apache: pdfbox-android:1.8.9.0’
We, the Apache PDFBox community, don't provide such a piece of software. There
is no official android version of PDFBox.

> My questions are
> A) any metadata I try and get from a PDF is returning null ie
> getDocumentInformation.getTitle()
> I know the document is being read as getNumberOfPages() returns the correct
> amount of pages.
> Is this an known issue with V 1.8.9.0
> I also know that the document has metadata as I can see it in the properties
> when I view it through adobe
Impossible to say without the pdf in question.

> B) is there a later android version I can reference? i.e compile ‘org.apache:
> pdfbox-android:1.8.11.0’
You might ask the origin author of pdfbox-android

> Thanks for your time
> 
> Paul

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Fwd: The Apache® Software Foundation announces Apache PDFBox™ v2.0

2016-03-21 Thread Andreas Lehmkühler




 Ursprüngliche Nachricht 
Von: Sally Khudairi <s...@apache.org>
Gesendet: 21. März 2016 12:44:18 MEZ
An: Apache Announce List <annou...@apache.org>
Betreff: The Apache® Software Foundation announces Apache PDFBox™ v2.0

>> this announcement is available online at https://s.apache.org/Ly9B

Milestone release of Open Source Java tool for working with PDF documents 
features dozens of improvements and enhancements

Forest Hill, MD —21 March 2016— The Apache Software Foundation (ASF), the 
all-volunteer developers, stewards, and incubators of more than 350 Open Source 
projects and initiatives, announced today the availability of Apache® PDFBox™ 
v2.0, the Open Source Java tool for working with Portable Document Format (PDF) 
documents. 

PDF was first released by Adobe Systems in 1993, and became an ISO 
International Standard - ISO 32000-1 in 2008. Apache PDFBox allows for the 
creation of new PDF documents, manipulation, rendering, signing of existing 
documents and the ability to extract content from documents. In addition, 
PDFBox includes several command line utilities. In February 2015, the project 
became the first Open Source Partner Organization of the PDF Association. 

"PDF is a very popular and easy to use format for document exchange. It is used 
by millions of people every day, however the format itself is quite complicated 
and a real challenge to write a piece of software to work with it," said 
Andreas Lehmkühler, Vice President of Apache PDFBox. "This new major release of 
PDFBox includes a lot of improvements, fixes and new features which should make 
the life easier for our users." 

Under The Hood 
The Apache PDFBox library enables users to create new PDF documents, manipulate 
existing documents, extract content, digitally sign, print, and validate files 
against the PDF/A-1b standard. Its command line utilities include encrypt, 
decrypt, overlay, debugger, merger, PDFToImage, and TextToPDF. 

PDFBox v2.0 reflects 1,167 solved issues, 418 of which were back-ported to 
v1.8, as well as dozens of improvements and enhancements. Highlights include: 

 - improved rendering and text extraction 
 - Unicode support for PDF creation 
 - overhauled interactive forms support 
 - extended signing and encryption support 
 - overhauled parser including a self-healing mechanism for malformed or 
corrupted PDFs 
 - reduced memory/resources footprint including fine grained control of memory 
usage 
 - enhanced preflight module for PDF/A-1b conformance checking 
 - rearranged package structure to allow smaller runtime environments 

A guide to migrating to v2.0 is available at 
http://pdfbox.apache.org/2.0/migration.html , with community support at 
http://pdfbox.apache.org/mailinglists.html 

"We thank all the people from our small but fine community for their support," 
explained Lehmkühler. "Special thanks also goes to our fellow colleagues from 
the Apache Tika project for their cooperation in stress-testing with a corpus 
of 250,000 PDF files." 

"We are grateful for the Google Summer of Code program," said PDFBox committer 
Tilman Hausherr. "The project allowed us to hire students to improve 3D 
rendering and the PDFDebugger stand-alone application, which also sped up our 
own bug finding." 

"Apache PDFBox v2.0 is a significant milestone as it took us several years to 
complete," added Lehmkühler. "This long-awaited release is the collective 
achievement of more than 150 individuals who have contributed code to date. 
Without their frequent contributions it wouldn't be possible to drive a project 
like PDFBox." 

Availability and Oversight 
Apache PDFBox software is released under the Apache License v2.0 and is 
overseen by a self-selected team of active contributors to the project. A 
Project Management Committee (PMC) guides the Project's day-to-day operations, 
including community development and product releases. For downloads, 
documentation, and ways to become involved with Apache PDFBox, visit 
http://pdfbox.apache.org/ 

About The Apache Software Foundation (ASF) 
Established in 1999, the all-volunteer Foundation oversees more than 350 
leading Open Source projects, including Apache HTTP Server --the world's most 
popular Web server software. Through the ASF's meritocratic process known as 
"The Apache Way," more than 550 individual Members and 5,300 Committers 
successfully collaborate to develop freely available enterprise-grade software, 
benefiting millions of users worldwide: thousands of software solutions are 
distributed under the Apache License; and the community actively participates 
in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's 
official user conference, trainings, and expo. The ASF is a US 501(c)(3) 
charitable organization, funded by individual donations and corporate sponsors 
including Alibaba Cloud Computing, ARM, Blo

Re: Spaces are ignored when reading a PDF file

2016-03-19 Thread Andreas Lehmkühler

Hi,

> Frank van der Hulst  hat am 17. März 2016 um 08:34
> geschrieben:
> 
> 
> Spaces don't exist as characters in PDFs. To identify spaces, you have to
> compare the X coordinates of adjacent characters against their widths.
That's not correct, spaces exist but in most cases pdf engines omit them and
replace spaces by a splitted text with an appropriate positioning.

BTW, latex uses the same strategy. Here is a excerpt from your pdf:

   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 (Article)
-384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
(the) -383 (right) ] TJ

The text is in between the braces and the numbers are used for horizontal
positioning.

BR
Andreas

> 
> On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  wrote:
> 
> > Hello ,
> >
> > I have a PDF file created using Latex. I am trying to read and print all
> > letters in that file using PDFBox, but when doing this all spaces in that
> > file are ignored. Here is the code I am using:
> > PDPage page = (PDPage)allPages.get( 0 );
> > PDStream contents = page.getContents();
> > if ( contents != null ) {
> > PDFTextStripperProcessor pdfTextStripperProcessor = new
> > PDFTextStripperProcessor();
> > pdfTextStripperProcessor.processStream( page, page.findResources(),
> > contents.getStream() );
> > }
> >
> > public class PDFTextStripperProcessor extends PDFTextStripper {
> > @Override
> > public void processTextPosition( TextPosition text )  {
> > System.out.println( text.getCharacter() );
> > }
> > }
> >
> > And you can check a one page file sample here to test it:
> >
> > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> >
> > What is the cause of this issue please?
> >
> >
> > Best regards ,
> > Hesham

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Spaces are ignored when reading a PDF file

2016-03-18 Thread Andreas Lehmkühler

> "Hesham G."  hat am 17. März 2016 um 11:20
> geschrieben:
> 
> 
> Andreas,
> 
> That is very helpful.
> 
> I can get the x location of each character using TextPosition.getX(), ex:
> W: 102.88399
> i: 114.18165
> t: 117.660614
> h: 121.55801
> d: 133.09477
> u: 140.3994
> e: 147.60838
> 
> So to detect the space between the 2 words "With" & "due" should I make 
> subtraction calculations between X of the last letter(h) and the X of the 
> first letter (d) and if the number is large than normal then this is a 
> space? I think this way might be risky in the detection, or what?
That's the short story. To decide what is normal could be quite tricky. You have
to take the following facts into account:

- different fonts have different widths (important if the font before the space
isn't the same than the font after the space)
- keep in mind that you have to take a scaling and sometimes a rotation into
account
- the "space" between characters may vary if the text is jusitified

There are certainly some other details which may be important as well, so that
you end up with some more or less heuristic. 

BR
Andreas

> Best regards ,
> Hesham
> 
> 
> Included message :
> 
> Hi,
> 
> > Frank van der Hulst  hat am 17. März 2016 um 
> > 08:34
> > geschrieben:
> >
> >
> > Spaces don't exist as characters in PDFs. To identify spaces, you have to
> > compare the X coordinates of adjacent characters against their widths.
> That's not correct, spaces exist but in most cases pdf engines omit them and
> replace spaces by a splitted text with an appropriate positioning.
> 
> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
> 
>[ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383 
> (Article)
> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
> (the) -383 (right) ] TJ
> 
> The text is in between the braces and the numbers are used for horizontal
> positioning.
> 
> BR
> Andreas
> 
> >
> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G.  wrote:
> >
> > > Hello ,
> > >
> > > I have a PDF file created using Latex. I am trying to read and print all
> > > letters in that file using PDFBox, but when doing this all spaces in 
> > > that
> > > file are ignored. Here is the code I am using:
> > > PDPage page = (PDPage)allPages.get( 0 );
> > > PDStream contents = page.getContents();
> > > if ( contents != null ) {
> > > PDFTextStripperProcessor pdfTextStripperProcessor = new
> > > PDFTextStripperProcessor();
> > > pdfTextStripperProcessor.processStream( page, page.findResources(),
> > > contents.getStream() );
> > > }
> > >
> > > public class PDFTextStripperProcessor extends PDFTextStripper {
> > > @Override
> > > public void processTextPosition( TextPosition text )  {
> > > System.out.println( text.getCharacter() );
> > > }
> > > }
> > >
> > > And you can check a one page file sample here to test it:
> > >
> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
> > >
> > > What is the cause of this issue please?
> > >
> > >
> > > Best regards ,
> > > Hesham
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PrintTextLocations 1.8 vs 2.0

2016-03-16 Thread Andreas Lehmkühler

Hi,

> Peter Prusinowski  hat am 16. März 2016 um 09:52
> geschrieben:
> 
> 
> Good morning,
> 
> thank you for the hints, now I am overwriting showGlyph() and trying to 
> get the value with
> 
>  PDSimpleFont sf = (PDSimpleFont) font;
>  String name = sf.getEncoding().getName(code);
>  sf.getPath(name).getBounds()
> 
> but I am getting the same height, no matter which font size is set. This 
> happens with type1 and truetype fonts. What am I doing wrong ?
The font provides always the same unscaled shapes. You have to take the text
transformation matrix and the font matrix into account. Have a look at
PageDrawer#showFontGlyph to see how to do so.

HTH
Andreas
> 
> Am 07.03.2016 um 18:16 schrieb Tilman Hausherr:
> > Am 07.03.2016 um 11:46 schrieb Peter Prusinowski:
> >> Okay, thank you for information. I tried to get the height with 
> >> getPath(). If its one of the 14 standard fonts, I can get the height 
> >> with PDType1Font.fontName.getPath(text.getUnicode()).getBounds()). 
> >> But I dont know how to get the information from other fonts in a 
> >> generic way. Do you have a hint for me ?
> >
> > It is not available for all fonts. It is available for all 
> > PDSimpleFont objects, except for PDType3Font (which doesn't draw just 
> > vectors).
> >
> > The best would be to look at the source code, at PageDrawer.java
> >
> > createGlyph2D() returns a Glyph2D for the font. That one you can use 
> > for glyph2D.getPathForCharacterCode(code);
> >
> > See also showFontGlyph(), you can override that one in a subclass.
> >
> > Have also a look at showGlyph(), this makes a difference between type3 
> > fonts and others. See also CustomGraphicsStreamEngine.
> >
> > Tilman
> >
> >
> >
> >>
> >> Peter
> >>
> >> Am 06.03.2016 um 17:40 schrieb Tilman Hausherr:
> >>>
> >>> In 1.8, for Standard 14 fonts (yours is) it uses the bounding box of 
> >>> each glyph. In a string, it uses a maximum which it keeps for the 
> >>> string, that results in the weird effect that the "d" is slightly 
> >>> higher. If the string is changed so that another glyph is appended, 
> >>> the larger height is kept.
> >>>
> >>> In 2.0 (and in 1.8 for non standard 14 fonts), it uses 1/2 of the 
> >>> bounding box from the font descriptor. The not-halved bounding box 
> >>> is usually too high.
> >>>
> >>> Anyway, the 1.8 logic would work for you for standard 14 fonts, but 
> >>> not for all other fonts.
> >>>
> >>> So there is no bug in 1.8 not in 2.0.
> >>>
> >>> Tilman
> >>>
> >>> Am 03.03.2016 um 19:05 schrieb Tilman Hausherr:
>  Am 03.03.2016 um 09:11 schrieb Peter Prusinowski:
> > Okay, I am trying to replace some words in documents and use 
> > text.height to "delete" these words. Here is an example document : 
> > http://workupload.com/file/G8ipDe8j
> 
>  The getHeightDir() is not the best strategy, for the reason I 
>  mentioned yesterday. In your case, you should call getPath() on the 
>  glyphs and get the bounding box. Or just get the font bounding box 
>  (there's a method) height, however that one is often too high, so 
>  there's a risk that you blank the line above.
> 
>  But thanks for the file, I'll try to find out why it is different. 
>  The heights in 1.8 are surprising, usually they are never so 
>  "perfect" (as I said yesterday). And for some reason, in 1.8 the 
>  height of the last glyph is slightly different although it is all 
>  in one string.
> 
>  1.8:
>  String[100.0,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=10.108002]H
>  String[110.108,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=7.784004]e
>  String[117.892006,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=3.8919983]l
>  String[121.784004,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=3.8919983]l
>  String[125.676,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=8.553993]o
>  String[134.23,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=3.8919983]
>  String[138.122,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=13.216003]W
>  String[151.338,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=8.554001]o
>  String[159.892,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=5.445999]r
>  String[165.338,92.0 fs=14.0 xscale=14.0 height=10.052001 
>  space=3.8920004 width=3.8919983]l
>  String[169.23,92.0 fs=14.0 xscale=14.0 *height=10.248001* 
>  space=3.8920004 width=8.554001]d  <= ???
> 
>  2.0:
>  String[100.0,92.0 fs=14.0 xscale=14.0 height=8.33 space=3.8920004 
>  width=10.108002]H
>  String[110.108,92.0 fs=14.0 xscale=14.0 height=8.33 space=3.8920004 
>  width=7.7839966]e
>  String[117.892,92.0 fs=14.0 xscale=14.0 height=8.33

Re: Fields and "]" + Checkboxes

2016-03-09 Thread Andreas Lehmkühler

Hi,

> Al Grant  hat am 8. März 2016 um 18:57 geschrieben:
> 
> 
> Morning All,
> 
> I have been writing some Java with PDFBox for a few weeks now. Its been
> very good so far.
> 
> My goal is to loop through all the fields in a form, grab the values and
> write the value to a corresponding field in a DB. By and large I have this
> working.
> 
> I however have two questions:
> 
> 1. When importing the value of a combobox I am getting the value enclosed
> in square braces. Anyone know why - or do I need to handle this
> progmatically?
getValue provides a list of strings as return value which leads to the described
string presentation with square braces. That list contains the selected value or
several values if multiselect is allowed and more than one value is selected.

> 2. The code so far loops through all the fields and grabs strings - but I
> am not sure how to handle exclusive checkboxes (ie only one value selected
> allowed).
I didn't get your point, but hopefully my answer to your first question answers
this one too? ;-)

> Cheers
> 
> -Al
> 
> 
> -- 
> "Beat it punk!"
> - Clint Eastwood


BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: memory consumption PDFBox 2.0.0

2016-03-01 Thread Andreas Lehmkühler

Hi,

> Felix Benz-Baldas  hat am 1. März 2016 um 12:35
> geschrieben:
> 
> 
> Hello,
> 
> we plan to use PDFBox 2.0.0 for converting PDFs to JPEG. We want to convert a
> very large number of documents (more than one million).
> 
> One question: Is it possible to control the memory-consumption? When I start
> my java program with "-Xmx2g" I ran into a "java.lang.OutOfMemoryError: Java
> heap space" after about 40 minutes.
> 
> With "-Xmx4g" the error did not occur.
> 
> Is there a way to reduce the memory-consumption?
It depends on the cause for that exception:

- your code has a memory leak
- PDFBox has a memory leak
- one or more of the pdfs you are processing is malformed or somehow criticial
- something else I'm missing 

As a start, please post the relevant part of your code. Run your app in a
profiler to analyze the memory consumption and garbage collection. Try to detect
critical pdfs and share them with us.

BR
Andreas

> Kind regards from the CAS Campus
> 
> Felix Benz-Baldas
> 
> CAS Ecosystems - eine SmartCompany der CAS Software AG -
> www.cas-ecosystems.de
> 
> 
> CAS Software AG · CAS-Weg 1 - 5 · 76131 Karlsruhe, Germany · Phone: +49
> 721-9638-0
> Successful relationships. www.cas.de/en ·
> linkedin
> Legals
> 
> Executive board: Martin Hubschneider (CEO) · Ludwig Neer
> Supervisory board: Dr. Dr. Jörg Maurer (chair) · Prof. Dr. Peter Lockemann ·
> Kurt Sibold
> Head Office: Karlsruhe · County Court Mannheim · Register of Companies Number:
> HRB 108751 · VAT Identification Number: DE143593148
>

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Rotating a new annotation to match the page's rotation

2016-02-24 Thread Andreas Lehmkühler

Hi,

> Gilad Denneboom  hat am 24. Februar 2016 um 09:34
> geschrieben:
> 
> 
> No one has any ideas? ...
> 
> On Sun, Feb 21, 2016 at 12:30 AM, Gilad Denneboom  > wrote:
> 
> > Hi all,
> >
> > Hoping someone can help me with this issue...
> > I have a tool that adds new highlight annotations to a page. It works very
> > well, except for when the page is rotated. I know I need to apply a
> > transformation to my rect and/or quads to get them to match the rotated
> > user space, but I just can't get it to work.
> > Is there a utility in PDFBox (I'm using 1.8.11 at the moment) that can
> > help me perform this transformation so I can place my annotations at the
> > right location on these pages?
> >
> > Thanks a lot in advance for any helpful tips...
I'm not an annotation expert, but according to the spec both the Rect and the
QuadPoints values are specified in default user space which doesn't include any
rotation or scaling. But I have no clue where to put these information instead.
Can you create a sample pdf with such an annotation using acrobat or something
similar so that we can have a look how it looks like?

> > Gilad

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

[ANNOUNCE] Apache PDFBox 2.0.0 RC3 released

2016-01-15 Thread Andreas Lehmkühler

The Apache PDFBox community is pleased to announce the release of
Apache PDFBox version 2.0.0 RC3. The release is available for download at:

http://pdfbox.apache.org/download.cgi

The numerous feedback on our second release candidate helps us to make
this release candidate better again, e. g. optimized font cache, improved text
extraction. A lot of bug fixes are included as well.
We'd like to thank everybody who helps us to get a step foward.
Please have a look at the new release candidate as well, so that the next
release hopefully could be the final one.

See the full release notes below for details about this release.

Release Notes -- Apache PDFBox -- Version 2.0.0-RC3

Introduction


The Apache PDFBox library is an open source Java tool for working with PDF
documents.

This is the third release candidate for the upcoming major release 2.0.0 of
PDFBox.
This release contains a lot of improvements, fixes and refactorings. The API is 
supposed to be stable, but we can't guarantee that there won't be any last
changes
to it before providing the final release candidate.

For more details on these changes and all the other fixes and improvements
included in this release, please refer to the following issues on the
PDFBox issue tracker at https://issues.apache.org/jira/browse/PDFBOX.

Sub-task

[PDFBOX-1869] - Implementation for ShadingType 1
[PDFBOX-1870] - PDFunctionType0 incorrect
[PDFBOX-2117] - AxialShadingContext is slow
[PDFBOX-2279] - Text with gradient not shown
[PDFBOX-2529] - Preflight: mention the page on which a problem has been found
[PDFBOX-2531] - better error message on not yet read stream
[PDFBOX-2535] - mention subtype in COSStream IOException
[PDFBOX-2536] - More specific TIFFFaxDecoder exceptions
[PDFBOX-2537] - do not discard underlying cause when creating validation error
[PDFBOX-2611] - possibly incorrect error message "Hexa String must have only
Hexadecimal Characters" in preflight
[PDFBOX-2612] - error "Destination contains invalid page reference 'null'" is
not detected by preflight
[PDFBOX-2613] - Conflicting /N information for OutputIntent not detected by
preflight
[PDFBOX-2614] - missing /Type/FontDescriptor not detected by preflight
[PDFBOX-2619] - XMP dates contain time zone, while document info dates do not,
and this isn't detected by preflight
[PDFBOX-2625] - Preflight error: The character with CID 0 should have a width
equals to 57.0, but has 57.78
[PDFBOX-2627] - Add block composer to handle multiline text
[PDFBOX-2630] - "loop in destinations" not detected by preflight
[PDFBOX-2647] - Check thumbnails in XMP metadata
[PDFBOX-2718] - Allow to create new AcroForm fields from scratch
[PDFBOX-2783] - Remove getCOSDictionary() method, adjust getCOSObject() return
type
[PDFBOX-2849] - fix problems with setting existing AcroForm buttons
[PDFBOX-2863] - Support the comb flag for PDF forms
[PDFBOX-2877] - Wrong text placement for autosize fields compared to Adobe
generated
[PDFBOX-2889] - Support appearance generation for choice fields
[PDFBOX-2900] - PDF Debugger doesn't print inline images correctly
[PDFBOX-2993] - Create a PDTransparencyGroup for added code clarity
[PDFBOX-2994] - Rename PDGroup to PDTransparencyGroupAttributes
[PDFBOX-3051] - COSArray.getObject() incorrect handling of indirect reference to
COSNull
[PDFBOX-3052] - NPE in PDFStreamEngine.ShowText when no font set
[PDFBOX-3053] - Text extraction fails with type 3 fonts
[PDFBOX-3057] - NPE in CFFParser.parseType1Dicts()
[PDFBOX-3060] - Catalog cannot be found
[PDFBOX-3061] - Word concatenation in 2.0 not in 1.8
[PDFBOX-3062] - Text extraction and height different in 2.0
[PDFBOX-3068] - Null metadata in 2.0 in some files that had metadata in 1.8.10
with old parser
[PDFBOX-3112] - Avoid crazy /Length1 values in font descriptor
[PDFBOX-3123] - Text extraction garbled in this file, was OK in 1.8
[PDFBOX-3125] - IndexOutOfBoundsException in PDFont.getWidth()
[PDFBOX-3126] - IndexOutOfBoundsException in PfbParser.parsePfb
[PDFBOX-3127] - Text with vertical font not extracted correctly
[PDFBOX-3129] - NullPointerException in PDFStreamEngine.showText()
[PDFBOX-3186] - Parsing fails when XRef stream object is 1 byte later

Bug

[PDFBOX-31] - bug with the Type3 font
[PDFBOX-37] - Text Extraction Weirdness
[PDFBOX-40] - Font problem when setting form value
[PDFBOX-53] - Problem getting value from PDRadioCollection
[PDFBOX-54] - please correct the SetField example
[PDFBOX-62] - Incorrect (zero) character widths returned in some docs
[PDFBOX-101] - ImportXFDF results in PDF with larger text fields
[PDFBOX-123] - too many space made in extracted text file
[PDFBOX-129] - Error when setting the value of a combo box to " "
[PDFBOX-159] - Field renaming character set problem
[PDFBOX-161] - java.util.EmptyStackException from PDFTextStripper.writeText
[PDFBOX-166] - ConvertColorSpace RGB to CMYK
[PDFBOX-198] - Tiff image problems
[PDFBOX-205] - Miscellaneous errors on valid files
[PDFBOX-239] - PDFToImage prints every word

Re: Shell Can't Find pdfbox

2015-11-03 Thread Andreas Lehmkühler

Hi,

> Jonathan Levi  hat am 3. November 2015 um 03:25 geschrieben:
> 
> 
> I'm finding that shell commands to use pdfbox-app-1.8.10.jar won't work unless
> the full path is used. Example:
> 
> drj-air:Desktop jonathan$ ls /usr/local/bin/pdfb*
> /usr/local/bin/pdfbox-app-1.8.10.jar
> drj-air:Desktop jonathan$ java -jar pdfbox-app-1.8.10.jar ExtractText
> Contact-Office-Letter.pdf 
> Error: Unable to access jarfile pdfbox-app-1.8.10.jar
> drj-air:Desktop jonathan$ java -jar /usr/local/bin/pdfbox-app-1.8.10.jar
> ExtractText Contact-Office-Letter.pdf 
> drj-air:Desktop jonathan$ 
> 
> Is there a shell variable that has to be set for Java to know where to look
> for jars?
I guess the issue might be that your user doesn't have sufficient rights to
access the jar.

Check the output of 

ls -al /usr/local/bin/pdfb*

and change the access permissions if necessary

> Using Mac OS X 10.11.1.
> 
> TIA,
> 
> Jonathan

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Failure to close files on parse error

2015-11-02 Thread Andreas Lehmkühler

Hi,

> Jesse Long  hat am 2. November 2015 um 12:26
> geschrieben:
> 
> 
> Hi All,
> 
> The changes to PDDocument in eb83a299bbe39c2e59735aca2b39bca312c1ddc4 
> were insufficient, please include attached patch.
Please provide a JIRA ticket number or a svn revision as a reference

TIA,
Andreas

> Thanks,
> Jesse
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Anyone know how to set up a bouncycastle?

2015-09-24 Thread Andreas Lehmkühler

Hi,

> Eric Douglas  hat am 18. September 2015 um 16:53
> geschrieben:
> 
> 
> I'm trying to read a PDF using pdfbox, and on one system I get this error:
> 
> cannot create instance of
> org.bouncycastle.jcajce.provider.digest.GOST3411$Mappings
> : java.security.AccessControlException: access denied
> ("java.security.SecurityPermission"
> "putProviderProperty.BC")
> java.lang.InternalError: cannot create instance of org.bouncycastle.jcajce.
> provider.digest.GOST3411$Mappings : java.security.AccessControlException:
> access denied ("java.security.SecurityPermission" "putProviderProperty.BC")
> org.bouncycastle.jce.provider.BouncyCastleProvider.loadAlgorithms(Unknown
> Source)
> org.bouncycastle.jce.provider.BouncyCastleProvider.setup(Unknown Source)
> org.bouncycastle.jce.provider.BouncyCastleProvider.access$000(Unknown
> Source)
> org.bouncycastle.jce.provider.BouncyCastleProvider$1.run(Unknown Source)
> java.security.AccessController.doPrivileged(Native Method)
> org.bouncycastle.jce.provider.BouncyCastleProvider.(Unknown Source)
> org.apache.pdfbox.pdmodel.encryption.SecurityHandlerFactory.(
> SecurityHandlerFactory.java:44)
> org.apache.pdfbox.pdmodel.encryption.PDEncryption.
> (PDEncryption.java:96)
> org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:436)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:321)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:890)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:821)
> 
> I looked up where and how to put this grant stuff.  This sounds
> complicated.
> https://docs.oracle.com/javase/8/docs/technotes/guides/security/PolicyFiles.html#FileSyntax
> 
> 
> I'm running webstart.  This should work for my client and server?  I just
> have to update the server?  This goes in the
> jdk1.8.0_60\jre\lib\security\java.security
> file?  I have to manually put this here for each server, and do it again
> any time we install a new Java version?
> I put this in my jnlp file.  It apparently didn't help this issue.
> 
>  
> 
> Is there a way to make this work without manually editing files we'll have
> to worry about later?  This is just for an application, server and clients
> are all on local network.
I'm not an expert, but there are several possible reasons.

Did you sign the jars you are using? Starting with 1.7.0_45 (I hope to remember
the correct version) signing the jars is mandantory. And there are some other
restrictions for the JNLP file itself.

Do you use the pdfbox-app jar? This could be problematic as the repacking of the
jar destroys the signature of the bouncy castle jar, which is needed for a JNLP
usage.

> 
> Is there a way to call PDDocument.load without using BouncyCastle, or
> without installing it in java security?
The bouncy castle stuff is needed as long as your pdfs are using encrypted data
and according to the stack trace it looks like you do.

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Something weird with PDFMergerUtility?

2015-08-14 Thread Andreas Lehmkühler



 Magnus Evensberget magnus.evensber...@gmail.com hat am 14. August 2015 um
 10:21 geschrieben:
 
 
 Rolled back to the commit c343a3f and then it works.
We are working with svn and I guess you are referring to the PDFBox github
mirror, aren't you? That git commit refers to rev 1693855.

BR
Andreas

 
 On Fri, 14 Aug 2015 at 09:04 Magnus Evensberget 
 magnus.evensber...@gmail.com wrote:
 
  files are here: https://github.com/magnusev/PdfBoxMergeProblem
 
  I use the newest version of the 2.0.0-SNAPSHOT
 
  hope you can help :)
 
 
  On Thu, 13 Aug 2015 at 22:56 Tilman Hausherr thaush...@t-online.de
  wrote:
 
  Hi,
 
  Can you tell what version you are using? And can you upload these files
  somewhere?
 
  Tilman
 
  Am 13.08.2015 um 21:10 schrieb Magnus Evensberget:
   I have the following example:
  
   PDDocument document = new PDDocument();
   PDDocument d = PDDocument.load(documentBytes.get());
  
   PDFMergerUtility merger = new PDFMergerUtility();
   merger.appendDocument(document, d);
   merger.appendDocument(document, d);
  
   document.save(c:/test/blankDocWithFields.pdf);
  
   the document has one textfield
  
   The PDF saved has both pages, but the page content is gone, leaving only
   the textfields to write in on both pages.
  
   It worked before I went on Holiday but when I came back one week later
  it
   did not work.
  
 
 
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
 
 

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Last commit in SVN HEAD broke the PDFWriter

2015-07-27 Thread Andreas Lehmkühler



 Andreas Lehmkühler andr...@lehmi.de hat am 27. Juli 2015 um 09:37
 geschrieben:
 
 
 Hi Roberto,
 
  Roberto Nibali rnib...@gmail.com hat am 27. Juli 2015 um 09:28
  geschrieben:
  
  
  Dear developers
  
  The last commit 1692730 by lehmi, 18:36, broke the PDFWriter. The result
  is: The file xx.pdf cannot be open; It may be damaged or use a file format
  that Preview doesn’t recognize. The same when using Acrobat Professional.
  
  Reverting the commit to the old one makes everything work again. What is
  the reason for this change?
  
  // use previous startXref value as new PREV value
  trailer.setLong(COSName.PREV, doc.getStartXref());
  //trailer.removeItem(COSName.PREV);
  
  The trailer.removeItem(COSName.PREV) works, the new
  trailer.setLong(COSName.PREV, doc.getStartXref()) code does not.
 Thanks for your hint.
 
 Can you be a little bit more specific, please? What kind of pdf did you try to
 sign? Can you provide us with a sample pdf?

I just had another idea. Can you please check the following change

if (trailer.getItem(COSName.PREV != null))
{
trailer.setLong(COSName.PREV, doc.getStartXref())
}

TIA
Andreas

 
  Best regards
  
  Roberto
 
 BR
 Andreas
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Last commit in SVN HEAD broke the PDFWriter

2015-07-27 Thread Andreas Lehmkühler

Hi Roberto,

 Roberto Nibali rnib...@gmail.com hat am 27. Juli 2015 um 09:28 geschrieben:
 
 
 Dear developers
 
 The last commit 1692730 by lehmi, 18:36, broke the PDFWriter. The result
 is: The file xx.pdf cannot be open; It may be damaged or use a file format
 that Preview doesn’t recognize. The same when using Acrobat Professional.
 
 Reverting the commit to the old one makes everything work again. What is
 the reason for this change?
 
 // use previous startXref value as new PREV value
 trailer.setLong(COSName.PREV, doc.getStartXref());
 //trailer.removeItem(COSName.PREV);
 
 The trailer.removeItem(COSName.PREV) works, the new
 trailer.setLong(COSName.PREV, doc.getStartXref()) code does not.
Thanks for your hint.

Can you be a little bit more specific, please? What kind of pdf did you try to
sign? Can you provide us with a sample pdf?

 Best regards
 
 Roberto

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Last commit in SVN HEAD broke the PDFWriter

2015-07-27 Thread Andreas Lehmkühler



 Roberto Nibali rnib...@gmail.com hat am 27. Juli 2015 um 10:29 geschrieben:
 
 
 Hi Andreas
 
 Thanks for the quick reply.
 
 On Mon, Jul 27, 2015 at 9:55 AM, Andreas Lehmkühler andr...@lehmi.de
 wrote:
 
 
 
   Andreas Lehmkühler andr...@lehmi.de hat am 27. Juli 2015 um 09:37
   geschrieben:
  
  
   Hi Roberto,
  
Roberto Nibali rnib...@gmail.com hat am 27. Juli 2015 um 09:28
geschrieben:
   
   
Dear developers
   
The last commit 1692730 by lehmi, 18:36, broke the PDFWriter. The
  result
is: The file xx.pdf cannot be open; It may be damaged or use a file
  format
that Preview doesn’t recognize. The same when using Acrobat
  Professional.
   
Reverting the commit to the old one makes everything work again. What
  is
the reason for this change?
   
// use previous startXref value as new PREV value
trailer.setLong(COSName.PREV, doc.getStartXref());
//trailer.removeItem(COSName.PREV);
   
The trailer.removeItem(COSName.PREV) works, the new
trailer.setLong(COSName.PREV, doc.getStartXref()) code does not.
   Thanks for your hint.
  
   Can you be a little bit more specific, please? What kind of pdf did you
  try to
   sign? Can you provide us with a sample pdf?
 
 
 As far as I can tell, I'm not trying to sign any document. I'm working on a
 tool that migrates form fields from a source document to a new template
 document (containing the same fields, however some CI/CD changes), and
 subsequently saves the document as a new PDF with suffix -migrated.
 
 In fact, I can trigger this behaviour with the following simple code (which
 does nothing else than open the source PDF, the template PDF, removes the
 security, sets the need for auto-generated appearances, and saves the
 template into a new PDF):
 
 private static PDDocument srcDoc;
 private static PDDocument tplDoc;
 
 @Test
 public static void SimpleTest() throws IOException {
 String ownerPassword = limitedHappiness;
 srcDoc = PDDocument.load(new File(./ccalt.pdf), ownerPassword);
 tplDoc = PDDocument.load(new File(./cctemp.pdf), ownerPassword);
 tplDoc.setAllSecurityToBeRemoved(true);
 srcDoc.close();
 tplDoc.getDocumentCatalog().getAcroForm().setNeedAppearances(true);
 tplDoc.save(ccmig.pdf);
 tplDoc.close();
 }
 
  Due to signed NDAs, I cannot send you the PDF. If we can't solve the
 issue, I'll try to generate a stripped down version for you, however that's
 going to take a day or so, and I have other pending issues which I'd like
 to address first, as the deadline for the final delivery of the tool is now
 definitely coming up.
OK, I see you are not using an incremental update feature. The name of the
modified method doWriteXRefInc pretends to be limited to incremental updates,
but it isn't.

 
  I just had another idea. Can you please check the following change
 
  if (trailer.getItem(COSName.PREV != null))
  {
  trailer.setLong(COSName.PREV, doc.getStartXref())
  }
 
 
 I modified to the following code, so it compiles:
 
 COSDictionary trailer = doc.getTrailer();
 // use previous startXref value as new PREV value
 //trailer.setLong(COSName.PREV, doc.getStartXref());
 //trailer.removeItem(COSName.PREV);
 if (trailer.getItem(COSName.PREV) != null)
 {
 trailer.setLong(COSName.PREV, doc.getStartXref());
 }
 
 No change, it still breaks my simple test case.
I'm not at home, so that I can't check that myself, but I guess the following
should do the trick
if (incrementalUpdate)
{
trailer.setLong(COSName.PREV, doc.getStartXref());
}

 
 On a side note (not understanding anything about PDFBox internals): Your
 change seems pretty invasive just from an outsider's perspective
 interpreting the method's name. Before your change, you basically seem to
 have removed the COS entry PREV, after your change, you set it to the
 position of the xref section. I'm sure you know what you're doing, it just
 does not look minimally invasive in what I would call the hotpath of
 PDFWriter ;).
 
 Cheers
 Roberto

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: How to configure Maven POM to include latest SNAPSHOT of PDFbox

2015-07-07 Thread Andreas Lehmkühler

Hi,


 Roberto Nibali rnib...@gmail.com hat am 7. Juli 2015 um 11:43 geschrieben:
 
 
 Hi
 
 How do I properly set the dependencies in my Maven POM, so I can use the
 latest SNAPSHOT of pdfbox?
 
 I tried the following (https://pdfbox.apache.org/2.0/getting-started.html),
 which does not work at all:
 
 To use the latest 2.0 snapshot release from the SVN trunk, you'll need to
 add the following dependency:
 
 dependency
   groupIdorg.apache.pdfbox/groupId
   artifactIdpdfbox-app/artifactId
   version2.0.0-SNAPSHOT/version/dependency
 
 You'll also need to add the following repository:
 
 repository
   idApacheSnapshot/id
   nameApache Repository/name
   urlhttps://repository.apache.org/content/groups/snapshots//url
   snapshots
 enabledtrue/enabled
   /snapshots/repository
 
 And why did the project change its name from pdfbox to pdfbox-app?
That was wrong. It has to be pdfbox. I've already fixed that.

 Then I tried to download all the 2.0.0-SNAPSHOT JARS I could find at:
 
 https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/
 
 Adding them manually to the POM didn't work:
 
 dependency
 groupIdorg.apache.pdfbox/groupId
 artifactIdpdfbox/artifactId
 version2.0.0-SNAPSHOT/version
 scopesystem/scope
 

 systemPath${project.basedir}/extLib/pdfbox-2.0.0-20150707.080520-1509.jar/systemPath
 /dependency
 
 Any ideas? Do I need the pdfbox or pdfbox-app artifact? How do I add
 potential dependencies like fontbox or jempbox (which hasn't been updated
 since 2014)?
The pdfbox-dependency should be sufficient.

 Also, it seems that the following packages do not form part of pdfbox
 anymore:
 
 import org.apache.pdfbox.exceptions.CryptographyException;
 import org.apache.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
 import org.apache.pdfbox.util.PDFOperator;
2.0.0 contains improvements, fixes and refactorings which leads to change sin
the api. The new version isn't binary compatible to 1.8.x 

 Or this could be the result of my inability to properly set the POM
 dependencies.
 
 Best regards
 Roberto

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFRenderer, PDDocument memory issue

2015-07-02 Thread Andreas Lehmkühler



 John Hewson j...@jahewson.com hat am 2. Juli 2015 um 06:10 geschrieben:
 
 
 
  On 1 Jul 2015, at 07:52, Tilman Hausherr thaush...@t-online.de wrote:
  
  Am 01.07.2015 um 10:16 schrieb Alex Sviridov:
  In my application I have real time memory graphs and they show that memory
  is very fast filled.
  When there is no more free memory getPageThumbImage hangs - no exception,
  nothing. But the code stops.
  When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory.
  How to solve this problem?
  
  If you're building from source, try this: in PDImageXObject.java, remove the
  line cachedImage = image;. This will consume less space if you have large
  PDFs with many images.
 
 We don't retain XObjects across pages (anymore), so that shouldn't be the
 cause of his gradual memory increase?
IMHO, it's quite simply to explain. During the initial parse all streams are
read and all the data is stored in COSStream (see COSParser#parseCOSStream).
That isn't a new behaviour and I'm working on a better solution (it's my last
TODO in PDFBOX-2301)

  Tilman
  
  
  
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
  
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Re[8]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler



 Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:59 geschrieben:
 
 
  Ok. Thank you very much for explanation. Could you say where this scratch
 file is located linux/windows?
java.io.File.createTempFile is used to create that file. It uses the default
temp directory. It's /tmp on linux. I'm not sure for windows as different
environment variables (TMP, TEMP, USERPROFILE, ) are used to search for such
a directory.

You may define your own temp directory using the following parameter when
starting your application

-Djava.io.tmpdir=PATH-TO-YOUR-TEMP


 
 
 Среда,  1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de:
  Alex Sviridov  ooo_satu...@mail.ru  hat am 1. Juli 2015 um 13:38
  geschrieben:
  
  
   The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
 Ah, that explains a lot. The pdf is a scanned document, every page holds a
 color
 image, consuming a lot of memory when processed
 
  I tried with load (fileName,true). The result - now I don't have memory
  problems. However now I have 2 problems:
 
  1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
  One
  thumbnail image is loaded about 4 seconds! 
 If it comes to huge pdfs, you have to die one death. Either you provide
 enough
 memory to do all the stuff in memory (fast) or you use a scratch file to save
 memory (slow)
 
 And yes, there is room for an improvement of the memory handling (read on
 demand, remove after usage) in PDFBox, but that is some future feature.
 Patches
 are welcome.
 
  2) Besides, as you see thumbnail images are loaded in separate thread.
  While
  this thread is running and I try to
  get big image for main content using   BufferedImage
  bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
  following exception:
  
  java.io.IOException: java.util.zip.DataFormatException: unknown compression
  method
      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
      at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
      at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
      at
  org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
      at
  org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
      at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146)
      at
  org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78)
      at
  org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
      at
  org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
      at
  org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
      at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
      at
  org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
      at
  org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
      at
  org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
    
      at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.lang.Thread.run(Thread.java:745)
  Caused by: java.util.zip.DataFormatException: unknown compression method
      at java.util.zip.Inflater.inflateBytes(Native Method)
      at java.util.zip.Inflater.inflate(Inflater.java:259)
      at java.util.zip.Inflater.inflate(Inflater.java:280)
      at
  org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
      at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
      ... 20 more
  
  How to solve these problems?
 PDFBox isn't supposed to be thread safe.
 
  
  
  Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler  andr...@lehmi.de
  :
  
  
   Alex Sviridov   ooo_satu...@mail.ru  hat am 1. Juli 2015 um 13:09
   geschrieben:
   
   
I decided to show all the code. I also send the pdf file - some file
   from
   internet I use for testing.
  The attachment didn't make it due to some restrictions to the mailing
  list.
  Please post a link to the origin source or another place where we can
  download
  the pdf in question.
  
   
   Task task = new Task() {
       @Override protected Integer call() throws Exception {
       for (int i=0;imodel.getTotalPages();i++){
       System.out.println(Point a:+i);
       WritableImage writableImage=model.getPageThumbImage(i);
       System.out.println(Point b:+i);
       ImageView imageView=new ImageView(writableImage);
       System.out.println(Point c:+i);
       Label label=new Label(Integer.toString(i+1));
       System.out.println(Point d:+i);
       VBox vBox=new VBox(imageView,label);
       System.out.println(Point e:+i);
       vBox.setAlignment

Re: Re[6]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler

 Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38 geschrieben:
 
 
  The file is here  https://yadi.sk/i/Y0fTuvHmhbZiE
Ah, that explains a lot. The pdf is a scanned document, every page holds a color
image, consuming a lot of memory when processed

 I tried with load (fileName,true). The result - now I don't have memory
 problems. However now I have 2 problems:

 1) All the thumbnail images are loaded. However, the speed is VERY SLOW. One
 thumbnail image is loaded about 4 seconds! 
If it comes to huge pdfs, you have to die one death. Either you provide enough
memory to do all the stuff in memory (fast) or you use a scratch file to save
memory (slow)

And yes, there is room for an improvement of the memory handling (read on
demand, remove after usage) in PDFBox, but that is some future feature. Patches
are welcome.

 2) Besides, as you see thumbnail images are loaded in separate thread. While
 this thread is running and I try to
 get big image for main content using   BufferedImage
 bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
 following exception:
 
 java.io.IOException: java.util.zip.DataFormatException: unknown compression
 method
     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
     at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
     at
 org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
     at
 org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
     at org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146)
     at
 org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78)
     at
 org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
     at
 org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
     at
 org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
     at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
     at
 org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
     at
 org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
     at
 org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
   
     at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.lang.Thread.run(Thread.java:745)
 Caused by: java.util.zip.DataFormatException: unknown compression method
     at java.util.zip.Inflater.inflateBytes(Native Method)
     at java.util.zip.Inflater.inflate(Inflater.java:259)
     at java.util.zip.Inflater.inflate(Inflater.java:280)
     at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
     ... 20 more
 
 How to solve these problems?
PDFBox isn't supposed to be thread safe.

 
 
 Среда,  1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler andr...@lehmi.de:
 
 
  Alex Sviridov  ooo_satu...@mail.ru  hat am 1. Juli 2015 um 13:09
  geschrieben:
  
  
   I decided to show all the code. I also send the pdf file - some file from
  internet I use for testing.
 The attachment didn't make it due to some restrictions to the mailing list.
 Please post a link to the origin source or another place where we can
 download
 the pdf in question.
 
  
  Task task = new Task() {
      @Override protected Integer call() throws Exception {
      for (int i=0;imodel.getTotalPages();i++){
      System.out.println(Point a:+i);
      WritableImage writableImage=model.getPageThumbImage(i);
      System.out.println(Point b:+i);
      ImageView imageView=new ImageView(writableImage);
      System.out.println(Point c:+i);
      Label label=new Label(Integer.toString(i+1));
      System.out.println(Point d:+i);
      VBox vBox=new VBox(imageView,label);
      System.out.println(Point e:+i);
      vBox.setAlignment(Pos.CENTER);
      vBox.setStyle(-fx-padding:5px 5px 5px
  5px;-fx-background-color:red);
      System.out.println(Point f:+i);
      Platform.runLater(new Runnable() {
      @Override
      public void run() {
   thumbFlowPane.getChildren().add(vBox);
      }
      });
      }
      return null;
      }
  };
  new Thread(task).start();
  
  And here is the tail of the output
  
  Point a:30
  Point b:30
  Point c:30
  Point d:30
  Point e:30
  Point f:30
  Point a:31
  
  What is scratch file? Sorry, I don't understand you.
 
 PDFBox holds a lot of temporary data in the memory. To reduce the memory
 footprint one can choose to use a scratch file instead, so that some/most

Re: Re[10]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler

Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 14:15 geschrieben:

Ok. Thank you again. I just don't understand one thing. What is the reason to
keep so large data if I only need to take page images and the most important I
DO IT BY PAGE?
PDFBox doesn't know that you are doing it page by page.

Is there no way not to keep data for previous pages if I need only data for
page N?
As I said, we don't have a read on demand mechanism yet. It is in our focus but
that will take a while, as the pdf format isn't that easy to work with and
therefore the code to be extended is more or less complex.

Среда, 1 июля 2015, 14:08 +02:00 от Andreas Lehmkühler andr...@lehmi.de:

Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:59
geschrieben:

Ok. Thank you very much for explanation. Could you say where this scratch
file is located linux/windows?
java.io.File.createTempFile is used to create that file. It uses the default
temp directory. It's /tmp on linux. I'm not sure for windows as different
environment variables (TMP, TEMP, USERPROFILE, ) are used to search for
such
a directory.

You may define your own temp directory using the following parameter when
starting your application

-Djava.io.tmpdir=PATH-TO-YOUR-TEMP

Среда, 1 июля 2015, 13:54 +02:00 от Andreas Lehmkühler andr...@lehmi.de
:
Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:38
geschrieben:

The file is here https://yadi.sk/i/Y0fTuvHmhbZiE
Ah, that explains a lot. The pdf is a scanned document, every page holds a
color
image, consuming a lot of memory when processed

I tried with load (fileName,true). The result - now I don't have memory
problems. However now I have 2 problems:

1) All the thumbnail images are loaded. However, the speed is VERY SLOW.
One
thumbnail image is loaded about 4 seconds!
If it comes to huge pdfs, you have to die one death. Either you provide
enough
memory to do all the stuff in memory (fast) or you use a scratch file to
save
memory (slow)

And yes, there is room for an improvement of the memory handling (read on
demand, remove after usage) in PDFBox, but that is some future feature.
Patches
are welcome.

2) Besides, as you see thumbnail images are loaded in separate thread.
While
this thread is running and I try to
get big image for main content using BufferedImage
bi=pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB); I get the
following exception:

java.io.IOException: java.util.zip.DataFormatException: unknown
compression
method
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
at org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
at
org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)
at
org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)
at
org.apache.pdfbox.pdfparser.BaseParser.init(BaseParser.java:146)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:78)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
at
org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
at
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
at
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)
at
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)
at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.zip.DataFormatException: unknown compression method
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at
org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)
... 20 more

How to solve these problems?
PDFBox isn't supposed to be thread safe.

Среда, 1 июля 2015, 13:17 +02:00 от Andreas Lehmkühler
andr...@lehmi.de
:

Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09
geschrieben:

I decided to show all the code. I also send the pdf file - some file
from
internet I use for testing.
The attachment didn't make

Re: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler



 Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 10:16 geschrieben:
 
 
  I want to display all page thumbnails. However I came across memory size
 problem with PDFRenderer or PDDocument - I don't know which one. 
 
 I have the following code:
    
     private PDDocument pdfDocument;
     
     private PDFRenderer pdfRenderer;
 
     public WritableImage getPageThumbImage(int page){
     WritableImage result=null;
     try {
     BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12,
 ImageType.RGB);
     result=SwingFXUtils.toFXImage(bi, null);
     } catch (IOException ex) {
  
     }
     return result;
     }
  .
 The method getPageThumbImage I run in for loop for every page.I set java
 memory heap to 500mb. 
 And I can get about 30 images using getPageThumbImage (if I set more memory I
 get more). 
 In my application I have real time memory graphs and they show that memory is
 very fast filled. 
 When there is no more free memory getPageThumbImage hangs - no exception,
 nothing. But the code stops.
 When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory. How
 to solve this problem?
There are 2 possible issues and maybe both are relevant.

1. PDFBox consumes more or less memory to load a pdf depending on the size and
the content of the pdf.

- Are you using the latest 2.0.0-SNAPSHOT? There were some improvements
concerning the memory footprint lately
- Try to use of a scratch file (there are load methods including a boolean
switcht ot activate that)

2. Your own implementation consumes more or less memory to process those
thumbnails

- check if you are releasing all resources (ecspecially those images you're
creating) you are using during your process

HTH,
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Re[4]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler



 Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 13:09 geschrieben:
 
 
  I decided to show all the code. I also send the pdf file - some file from
 internet I use for testing.
The attachment didn't make it due to some restrictions to the mailing list.
Please post a link to the origin source or another place where we can download
the pdf in question.

 
 Task task = new Task() {
     @Override protected Integer call() throws Exception {
     for (int i=0;imodel.getTotalPages();i++){
     System.out.println(Point a:+i);
     WritableImage writableImage=model.getPageThumbImage(i);
     System.out.println(Point b:+i);
     ImageView imageView=new ImageView(writableImage);
     System.out.println(Point c:+i);
     Label label=new Label(Integer.toString(i+1));
     System.out.println(Point d:+i);
     VBox vBox=new VBox(imageView,label);
     System.out.println(Point e:+i);
     vBox.setAlignment(Pos.CENTER);
     vBox.setStyle(-fx-padding:5px 5px 5px
 5px;-fx-background-color:red);
     System.out.println(Point f:+i);
     Platform.runLater(new Runnable() {
     @Override
     public void run() {
  thumbFlowPane.getChildren().add(vBox);
     }
     });
     }
     return null;
     }
 };
 new Thread(task).start();
 
 And here is the tail of the output
 
 Point a:30
 Point b:30
 Point c:30
 Point d:30
 Point e:30
 Point f:30
 Point a:31
 
 What is scratch file? Sorry, I don't understand you.

PDFBox holds a lot of temporary data in the memory. To reduce the memory
footprint one can choose to use a scratch file instead, so that some/most of
that data will be hold in a file.

To do so, simply use another load method, e.g. 

load(File file, boolean useScratchFiles)
 
 
 
 
 
 
 Среда,  1 июля 2015, 13:04 +02:00 от Andreas Lehmkühler andr...@lehmi.de:
 
 
  Alex Sviridov  ooo_satu...@mail.ru  hat am 1. Juli 2015 um 12:58
  geschrieben:
  
  
   Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar
  the
  result is the same.
  
  When I create images I add them to javafx FlowPane. However, the problem is
  not in images because I repeat - I get 400mb when I do
  pdfDocument=null,pdfRenderer=null.
  
  Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I
  don't
  have any problems with memory. 
  
  I'm getting problem with memory when I run in for loop getPageThumbImage.
  
  I am sure that the problem is in PdfBox. Please, help me.
 Maybe, but I'm not sure at all.
 
 Try to use the scratch file.
 
  Среда,  1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler  andr...@lehmi.de
  :
  
  
   Alex Sviridov   ooo_satu...@mail.ru  hat am 1. Juli 2015 um 10:16
   geschrieben:
   
   
I want to display all page thumbnails. However I came across memory
   size
   problem with PDFRenderer or PDDocument - I don't know which one. 
   
   I have the following code:
      
       private PDDocument pdfDocument;
       
       private PDFRenderer pdfRenderer;
   
       public WritableImage getPageThumbImage(int page){
       WritableImage result=null;
       try {
       BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12,
   ImageType.RGB);
       result=SwingFXUtils.toFXImage(bi, null);
       } catch (IOException ex) {
    
       }
       return result;
       }
    .
   The method getPageThumbImage I run in for loop for every page.I set java
   memory heap to 500mb. 
   And I can get about 30 images using getPageThumbImage (if I set more
   memory
   I
   get more). 
   In my application I have real time memory graphs and they show that
   memory
   is
   very fast filled. 
   When there is no more free memory getPageThumbImage hangs - no
   exception,
   nothing. But the code stops.
   When I do pdfDocument=null,pdfRenderer=null I get about 400mb free
   memory.
   How
   to solve this problem?
  There are 2 possible issues and maybe both are relevant.
  
  1. PDFBox consumes more or less memory to load a pdf depending on the size
  and
  the content of the pdf.
  
  - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements
  concerning the memory footprint lately
  - Try to use of a scratch file (there are load methods including a boolean
  switcht ot activate that)
  
  2. Your own implementation consumes more or less memory to process those
  thumbnails
  
  - check if you are releasing all resources (ecspecially those images
  you're
  creating) you are using during your process
  
  HTH,
  Andreas
  
  -
  To unsubscribe, e-mail:  users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail:  users-h...@pdfbox.apache.org
  
  
  
  -- 
  Alex Sviridov
 
 BR
 Andreas

Re: Re[2]: PDFRenderer, PDDocument memory issue

2015-07-01 Thread Andreas Lehmkühler



 Alex Sviridov ooo_satu...@mail.ru hat am 1. Juli 2015 um 12:58 geschrieben:
 
 
  Thank you for answer. I tried pdfbox-app-2.0.0-20150630.220424-1464.jar the
 result is the same.
 
 When I create images I add them to javafx FlowPane. However, the problem is
 not in images because I repeat - I get 400mb when I do
 pdfDocument=null,pdfRenderer=null.
 
 Bedised, when I do pdfDocument = PDDocument.load(new File(fileName)) I don't
 have any problems with memory. 
 
 I'm getting problem with memory when I run in for loop getPageThumbImage.
 
 I am sure that the problem is in PdfBox. Please, help me.
Maybe, but I'm not sure at all.

Try to use the scratch file.

 Среда,  1 июля 2015, 12:48 +02:00 от Andreas Lehmkühler andr...@lehmi.de:
 
 
  Alex Sviridov  ooo_satu...@mail.ru  hat am 1. Juli 2015 um 10:16
  geschrieben:
  
  
   I want to display all page thumbnails. However I came across memory size
  problem with PDFRenderer or PDDocument - I don't know which one. 
  
  I have the following code:
     
      private PDDocument pdfDocument;
      
      private PDFRenderer pdfRenderer;
  
      public WritableImage getPageThumbImage(int page){
      WritableImage result=null;
      try {
      BufferedImage bi=pdfRenderer.renderImageWithDPI(page, 12,
  ImageType.RGB);
      result=SwingFXUtils.toFXImage(bi, null);
      } catch (IOException ex) {
   
      }
      return result;
      }
   .
  The method getPageThumbImage I run in for loop for every page.I set java
  memory heap to 500mb. 
  And I can get about 30 images using getPageThumbImage (if I set more memory
  I
  get more). 
  In my application I have real time memory graphs and they show that memory
  is
  very fast filled. 
  When there is no more free memory getPageThumbImage hangs - no exception,
  nothing. But the code stops.
  When I do pdfDocument=null,pdfRenderer=null I get about 400mb free memory.
  How
  to solve this problem?
 There are 2 possible issues and maybe both are relevant.
 
 1. PDFBox consumes more or less memory to load a pdf depending on the size
 and
 the content of the pdf.
 
 - Are you using the latest 2.0.0-SNAPSHOT? There were some improvements
 concerning the memory footprint lately
 - Try to use of a scratch file (there are load methods including a boolean
 switcht ot activate that)
 
 2. Your own implementation consumes more or less memory to process those
 thumbnails
 
 - check if you are releasing all resources (ecspecially those images you're
 creating) you are using during your process
 
 HTH,
 Andreas
 
 -
 To unsubscribe, e-mail:  users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail:  users-h...@pdfbox.apache.org
 
 
 
 -- 
 Alex Sviridov

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Scratch files - too many files open

2015-06-05 Thread Andreas Lehmkühler

Hi,

 Jesse Long jesse.long...@gmail.com hat am 3. Juni 2015 um 13:20 geschrieben:
 
 
 On 03/06/2015 12:46, Andreas Lehmkühler wrote:
  Hi,
 
  Jesse Long jesse.long...@gmail.com hat am 3. Juni 2015 um 08:45
  geschrieben:
 
 
  On 02/06/2015 17:48, Andreas Lehmkuehler wrote:
  Hi,
 
  Am 02.06.2015 um 16:15 schrieb Jesse Long:
  Hi All,
 
  Regarding PDFBOX-2301, and the use of scratch files: right now, each
  COSStream
  uses one or two scratch files.
 
  I recently ran into the problem on Linux where the max number of open
  files
  allowed to the JVM by the OS was reached because of this.
 
  Is there a plan around this?
 
  Is it maybe that my use case is not expected?
  I'm aware of that. The refactoring is still in progress. I expect to
  reduce the number of open files.
 
  My use case is:
  Open PDDocument 1
  Open PDDocument 2
  for a few hundred times
import page 1 of PDDocument 1 into PDDocument 2 and overlay
  some stuff
  ontop.
  save PDDocument 2.
 
  I have written a patch to use one single java.io.RandomAccessFile as
  a scratch
  file per COSDocument, using pages in a doubly linked list to separate
  streams in
  the same file. Would you be interested in adding this to PDFBox?
  To use one file only led to problems when creating pdfs from scratch.
  It is possible to write to 2 COSStreams at the same time which
  corrupts pdf.
  Hi Andreas,
 
  Do you mean at the same time, as in multiple threads, or single thread
  writing a bit to this stream and then a bit to another stream back and
  forth?
  It's about the second case. You can't add fonts and/or images to a page
  while
  adding content to a contentstream the same time. You have to add those
  before
  opening a stream or you have to close the stream before
 
  For the single thread use case, I have solved this in my patch.
  Actually, even multiple thread should be easy to support with
  synchronization. I'll work on some docs and submit and you can see if
  you like it.
  At least it sounds interesting and I'm happy to look at it.
 
 
 Please see patch attached.
I've attached your patch to PDFBOX-2301 so that it can't get lost.

 
 Thanks,
 Jesse

BR
Andreas

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: IllegalArgumentException when using PDType1Font.HELVETICA

2015-05-27 Thread Andreas Lehmkühler

Hi,

 Johanneke Lamberink johanneke.lamber...@onior.com hat am 27. Mai 2015 um
 10:52 geschrieben:
 
 
 Hi,
 
 When writing a given String to a PDF I am encountering the following
 stacktraces in the logging:
 
 
 Caused by: java.lang.IllegalArgumentException: No glyph for U+000A in font
 Helvetica
 at
 org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:320)
 at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:282)
 at
 org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:358)
 
 and:
 
 
 
 Caused by: java.lang.IllegalArgumentException: This font type only supports
 8-bit code points
 at
 org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:311)
 at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:282)
 at
 org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:311)
 
 I am not sure if this is a problem in my choice of font, my use of the api, or
 the encoding done by pdfbox.
 
 Can anyone explain to me what it is that is going wrong here?
U+000A sounds like line feed. Does your string contain any newline characters
like CR or LF? You have to remove those, as you have to manage line breaks
yourself.

 Thanks :)
 
 
 Johanneke Lamberink

BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: How to flatedecode and find all acroform fields in a compressed PDF

2015-05-22 Thread Andreas Lehmkühler

Hi,

 Balaji Venkatamohan bvenk...@tibco.com hat am 20. Mai 2015 um 03:24
 geschrieben:
 
 
 Thank you for your pointers and sorry about the image. I am attaching it
 with this email.
 
 The point I am trying to make is that the PDF, which was decompressed using
 WriteDecodedDoc, is smaller in size than the original PDF given to us by
 our customers.
 Also, the decompressed PDF generated by WriterDecodedDoc of PDFBox did not
 have any PDAcroform fields whereas the decompressed PDF given to us by the
 customers does contain Acroform fields. Hence I wanted to know how to
 properly decompress the PDF using pdfbox APIs. The reason why I was
 analyzing COSStream was to check if the decompression of the compressed PDF
 was happening correctly while using PDFBox APIs.
 I know it would have been difficult for you to help me without the actual
 PDFs. For that, I would like to thank you for your time and pointers.
Maybe it's worth to try to share the file visually with us. Open both files
(compressed and decompressed) with PDFDebugger [1] and post a screenshot of both
somehwere (dropbox etc.) and share the link with us. Maybe that could shed some
light on your issue.

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/1.8/commandline.html#pdfDebugger

 
 On Tue, May 19, 2015 at 2:57 PM, Tilman Hausherr thaush...@t-online.de
 wrote:
 
  Hi,
 
  The image doesn't appear in the mailing list.
 
  This is all very confusing... /acroform is in the document catalog. I
  don't see how the page content stream is related to it. The best is that
  you either go through the source code, or read the spec and then look at
  the pdf.
 
  To find out what's going on, you'd have to start from that /acroform entry
  and then compare the two files.
 
  It is really difficult to help you without the files. The cause could be a
  bug in pdfbox, or a malformed pdf...
 
  Some more ideas:
  - use loadNonSeq(file, null) instead of load(file)
  - try the unreleased 2.0 version, that one has some improvements in the
  acroform stuff. Note that the API is different.
  https://pdfbox.apache.org/download.cgi#scm
  https://pdfbox.apache.org/2.0/getting-started.html
 
  If you still need help, one possibility would be 1) post the smallest
  possible code that fails, and 2) post a small part of the raw PDF, i.e. the
  objects relevant to the field in your code.
 
 
  Tilman
 
 
  Am 19.05.2015 um 23:03 schrieb Balaji Venkatamohan:
 
  Moreover, for every page of the compressed PDF (there are 3 pages), I
  tried getting the COSStream for each of the page :
 
  PDPage firstPage=(PDPage)
  document.getDocumentCatalog().getAllPages().get(0);
  pdStream=firstPage.getContents();
  COSStream stream=pdStream.getStream();
 
  In the above code snippet, the object stream, when analyzed in debug
  mode, has the following:
 
 
  The line from the compressed PDF as opened with Notepad++ is :
 
  /Filter/FlateDecode/Length 5675stream
 
  From this point on, using the COSStream object for every page, how can I
  decompress and find out the acroform fields given that the unFilteredStream
  object is null for COSStream?
  
 
  On Tue, May 19, 2015 at 1:38 PM, Balaji Venkatamohan bvenk...@tibco.com
  mailto:bvenk...@tibco.com wrote:
 
  Thank you for your response Tilman.
 
  I had previously tried using the WriteDecodedDoc for my compressed
  PDF and I tried to get the number of acro form fields present in
   the output file generated by WriteDecodedDoc. The API still could
  not find the acro form fields in the generated decompressed file.
   Also the decompressed file generated is 75 KB which is far less
  than the original decompressed file which I have (1.6 MB) though I
  could edit the acro form fields using acrobat reader.
 
  Thanks,
  Balaji
 
 
 
  On Tue, May 19, 2015 at 1:18 PM, Tilman Hausherr
  thaush...@t-online.de mailto:thaush...@t-online.de wrote:
 
  Am 19.05.2015 um 21:35 schrieb Balaji Venkatamohan:
 
  My question is: how do I flatedecode a PDF so that I can
  find all the
  acroform fields within it. ANy help or pointers would be
  highly appreciated.
 
 
  You could try the WriteDecodedDoc option of the command line app
  https://pdfbox.apache.org/1.8/commandline.html#writeDecodeDoc
 
  Maybe you can have further ideas by comparing the two files
  with NOTEPAD++ however the two files might have their
  objects in different order.
 
  Tilman
 
 
 
 
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  mailto:users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
  mailto:users-h...@pdfbox.apache.org
 
 
 
 
 
 
 -
 To unsubscribe, e-mail

Re: java source in PDFBox snapshot jars?

2015-04-22 Thread Andreas Lehmkühler

Hi,

 Thomas Chojecki i...@rayman2200.de hat am 21. April 2015 um 18:19
 geschrieben:
 
 
 Hi Andrew,
 this is more a maven style to not include the sources jar in each  
 snapshot. I think the main idea behind this is to help the  
 infrastructure to save storage. At the present time, it makes no sense  
 to spare space because it is not expensive, but maybe someone should  
 ask the apache infra first, before changing it.
IMHO, we don't need to do so. Those SNAPHOTs are deleted on a regularly basis,
so that only the most recent versions are available. Furthermore the PDFBox jars
are quite small compared to other projects, so that infra most likely won't any
headache if we put those source jars into the repo as well.

I'll try to find out what we have to do to publish the sources as well, see
PDFBOX-2770

BR
Andreas Lehmkühler

 
 BR
 Thomas
 
 Zitat von Andrew Munn and...@nmedia.net:
 
  Is it possible that the sources could start being included in the snapshot
  jars so debugging problems is easier?
 
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
 
 
 
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Blank page rendered with wrong xref start objid (batch 1.8)

2015-03-26 Thread Andreas Lehmkühler

Hi,

 jg...@e-nautia.com hat am 25. März 2015 um 15:25 geschrieben:
 
 
 Hello,
 
 bug PDFBOX-2679 entitled Blank page rendered with wrong xref start 
 objid was recently fixed for branch 2.0.0 but this same issue is still 
 affecting NonSequentialParser v 1.8.8 as it is also rendering a blank 
 page with that kind of malformed pdfs (in our case these pdfs are 
 generated by some soho scanners!!).  Do you plan to fix this issue also 
 for branch 1.8 or at least open a jira?
No, we don't backport every fix from the trunk to 1.8 for different reasons.

If someone wants to do so, patches are welcome :-)

BR
Andreas Lehmkühler 

 thank you
 Jerome
 
 
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Text removal

2015-03-24 Thread Andreas Lehmkühler

Hi,

 a7med shre3y a7med.shr...@gmail.com hat am 23. März 2015 um 15:03
 geschrieben:
 
 
 Hi all,
 
 Currently I am facing a strange problem removing text from the some PDFs.
 My program is able to find the text and remove it by calling the
 COSString.reset() method.
 The problem is, when I open the output PDF file, I still see the text but
 not selectable (I mean when I try to highlight it with the mouse to copy
 it, it's not selectable!). When print the content (tokens) of the output
 file, I DO NOT find the text at all!!
 
 I am currently stuck in the PDF specifications 1.5 and really running out
 of time.
 
 I'd so much appreciate any help or any idea on what's going on.
 
 Notes:
 1. I use use PDFBox 1.7.1
1.7.1 is more than 2 years old (released in july 2012). I strongly recommend to
use a more recent version, such as 1.8.8

BR
Andreas Lehmkühler

 2. This problem does not occur with all PDFs, only some PDFs cause this
 problem.
 
 Thank you very much.
 a7mad

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Problem building the project with Eclipse and m2e

2015-03-17 Thread Andreas Lehmkühler



 Martin Schröder mar...@oneiros.de hat am 16. März 2015 um 21:26 geschrieben:
 
 
 2015-03-16 20:27 GMT+01:00 Andreas Lehmkuehler andr...@lehmi.de:
  Am 13.03.2015 um 13:16 schrieb Martin Schröder:
  that gives a lot of projects as expected (after one has the right m2e
  connector for subversion (which is difficult with subclipse)).
 
  There are other alternatives. Checkout the trunk/unpack the soure zip and
  import it as existing maven project.
 
 I tried that first. Then I get one project. Is that better?
Nope, I guess something is still wrong.

Use File-Import-Maven-Existing Maven Projects and choose the top most
directory. Eclipse should propose to import several subprojects. Import them all
and that's it.

AFAIK the maven import feature is only avaiable if m2e is installed.

  IMHO, m2e is a crappy piece of software and I guess I'm not alone.
 
 Agreed. But what's the alternative? :-{
I don't know any.

 Best
Martin

BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Question about PDDocument.setVersion

2015-03-04 Thread Andreas Lehmkühler

Hi Andrea,

 Andrea Vacondio andrea.vacon...@gmail.com hat am 4. März 2015 um 14:15
 geschrieben:
 
 
 Hi, about 2.0.0-SNAPSHOT I was setting version on an existing document and
 I noticed the version was set on the Catalog but not in the header so I
 took a look at the code and I think there's something odd there (or I'm
 missing something).
 It first makes sure we are not downgrading the version and then we have the
 following code (see my comment):
 
 if (newVersion = 1.4f)
 {
 getDocumentCatalog().setVersion(Float.toString(newVersion));
 //isn't this always false? We already know newVersion is greater...
 if (getDocument().getVersion()  newVersion)
 {
 getDocument().setVersion(newVersion);
 }
 }
 else
 {
 // versions  1.4f have a version header only
 getDocument().setVersion(newVersion);
 }
 
 I'm not fully sure what's the expected behaviour but I guess it's something
 like if newVer is less then 1.4 then set the header else set both header
 and catalog so something like:
 if (newVersion = 1.4f)
 {
 
 getDocumentCatalog().setVersion(Float.toString(newVersion));
 }
 getDocument().setVersion(newVersion);
 
 Am I missing something?
You're right there is some room for improvements. I've already reopened the
related ticket PDFBOX-2099

Thanks for the pointer

BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: PDFBox 2.0.0 and UTF8 chars

2015-03-02 Thread Andreas Lehmkühler

Hi

 Tilman Hausherr thaush...@t-online.de hat am 1. März 2015 um 19:54
 geschrieben:
 
 
 Heh heh, I wanted to make a similar comment, but then I saw the stack 
 trace showing that he did just that...
Ups. you are right. The stack trace doesn't belong to the listed code. So, most
likely thers is an issue with that specific font. Either a malformed font or a
fontbox issue.

BR
Andreas Lehmkühler
 
 Tilman
 
 Am 01.03.2015 um 18:53 schrieb Andreas Lehmkuehler:
  Hi,
 
  Am 28.02.2015 um 11:52 schrieb Ivan Klaric:
  Hello good PDFBox people,
 
  I am working on a pet project with PDFBox and I encountered what 
  seems to
  be an issue with UTF8 chars. If you take the following standard example:
 
   public static void main(String[] args) {
   try {
   PDDocument document = new PDDocument();
   PDPage page = new PDPage();
   document.addPage( page );
   PDFont font = PDTrueTypeFont.loadTTF(document, new
  File(res/Roboto-Regular.ttf));
 
  Try to load the TTF font as a Type0 font
 
  PDFont font = PDType0Font.load(document, new 
  File(res/Roboto-Regular.ttf));
 
  BR
  Andreas Lehmkühler
 
   PDPageContentStream contentStream = null;
   contentStream = new PDPageContentStream(document, page);
   contentStream.beginText();
   contentStream.setFont( font, 12 );
   contentStream.moveTextPositionByAmount( 100, 700 );
   contentStream.drawString( Hello World čćžšđČĆŽŠĐ );
   contentStream.endText();
   contentStream.close();
   document.save( /tmp/HelloWorld.pdf);
   document.close();
 
   } catch (IOException e) {
   e.printStackTrace();
   }
   }
 
  (those weird characters in the drawString method are some pretty common
  croatian letters). This is what I get:
  java.io.IOException: Error: Could not find referenced cmap stream 
  Identity-H
  at 
  org.apache.fontbox.cmap.CMapParser.getExternalCMap(CMapParser.java:418)
  at 
  org.apache.fontbox.cmap.CMapParser.parsePredefined(CMapParser.java:84)
  at
  org.apache.pdfbox.pdmodel.font.CMapManager.getPredefinedCMap(CMapManager.java:54)
  
 
  at
  org.apache.pdfbox.pdmodel.font.PDType0Font.readEncoding(PDType0Font.java:159)
  
 
  at 
  org.apache.pdfbox.pdmodel.font.PDType0Font.init(PDType0Font.java:119)
  at org.apache.pdfbox.pdmodel.font.PDType0Font.load(PDType0Font.java:59)
  at com.company.Main.main(Main.java:20)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  
 
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 
  at java.lang.reflect.Method.invoke(Method.java:483)
  at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 
 
  Am I doing something wrong? I took the Roboto-Regular font here:
  http://www.fontsquirrel.com/fonts/roboto
 
  If I remove the weird Croatian characters, the error remains the same.
  However, if I use the PDTrueTypeFont.loadTTF() (which seems to be
  deprecated) the same thing works without the Croatian characters. If 
  I put
  the Croatian characters back in (and use PDTrueTypeFont), I get
 
  Exception in thread main java.lang.IllegalArgumentException: U+010D is
  not available in this font's Encoding
  at
  org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.encode(PDTrueTypeFont.java:261)
  
 
  at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:268)
  at
  org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:316)
  
 
  at
  org.apache.pdfbox.pdmodel.PDPageContentStream.drawString(PDPageContentStream.java:282)
  
 
  at com.company.Main.main(Main.java:25)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  
 
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  
 
  at java.lang.reflect.Method.invoke(Method.java:483)
  at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 
  I manually looked into the font file and it seems to contain the U+010D
  character. What am I doing wrong here?
 
  Thanks,
  Ivan
 
 
 
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
 
 
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: [PDFBOX-2.0] PDF Size after Signature

2015-02-27 Thread Andreas Lehmkühler

Hi,

 Maruan Sahyoun sahy...@fileaffairs.de hat am 27. Februar 2015 um 09:19
 geschrieben:
 
 
 Hi Andreas,
 
 the changes you made were these before or after the ones I did to COSBase wrt
 to immutable objects (PDFBOX-2685)?
My changes (PDFBOX-1822, PDFBOX-2515) where made prior to PDFBOX-2685. Saying
that, r1659998 could introduce a regression. I had a quick look and maybe the
changes made to COSWriter are the root cause. But we have to debug that first to
be sure.

BR
Andreas Lehmkühler

 BR
 Maruan
 
 Am 27.02.2015 um 08:45 schrieb Andreas Lehmkühler andr...@lehmi.de:
 
  Hi,
  
  Tilman Hausherr thaush...@t-online.de hat am 27. Februar 2015 um 07:35
  geschrieben:
  
  
  Did you just start with signing or is this a recent phenomenon, i.e. 
  didn't happen a month ago?
  
  I looked at both files - in the 1.8 one, only the changed objects appear 
  after EOF. In the 2.0 one, all objects are there ?!
  Correct, something went wrong when appending the changed objects only. It
  work
  for 
  me when I fixed the encryption stuff. I seems as if some recent change
  introduced
  this regression.
  
  @Isaias
  Which exact version/revision of the trunk are you using?
  
  BR
  Andreas Lehmkühler
  
  Tilman
  
  Am 27.02.2015 um 05:44 schrieb Isaias Barroso:
  Hi all,
  
  I'm using PDFBOX 2.0 to sign some documents and I found that the size of
  signed file is too big if compared with 1.8 version, sometimes those files
  get their sizes  increased in 100% or more. When the same file is signed
  using 1.8 the file is increased in a expected way.
  
  Original File: https://www.dropbox.com/s/s8p40ukorhchtcu/sign_me.pdf?dl=0
  
  Signed With 1.8:
  https://www.dropbox.com/s/ty8axylq8ol6204/sign_me_signed_1.8.pdf?dl=0
  
  Signed With 2.0:
  https://www.dropbox.com/s/ge1x3mdpqlalnvq/sign_me_signed_2.0.pdf?dl=0
  
  There is some option to reduce the signed file size on 2.0?
  
  Best regards
  
  
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
  
  
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: [PDFBOX-2.0] PDF Size after Signature

2015-02-26 Thread Andreas Lehmkühler

Hi,

 Tilman Hausherr thaush...@t-online.de hat am 27. Februar 2015 um 07:35
 geschrieben:
 
 
 Did you just start with signing or is this a recent phenomenon, i.e. 
 didn't happen a month ago?
 
 I looked at both files - in the 1.8 one, only the changed objects appear 
 after EOF. In the 2.0 one, all objects are there ?!
Correct, something went wrong when appending the changed objects only. It work
for 
me when I fixed the encryption stuff. I seems as if some recent change
introduced
this regression.

@Isaias
Which exact version/revision of the trunk are you using?

BR
Andreas Lehmkühler
 
 Tilman
 
 Am 27.02.2015 um 05:44 schrieb Isaias Barroso:
  Hi all,
 
  I'm using PDFBOX 2.0 to sign some documents and I found that the size of
  signed file is too big if compared with 1.8 version, sometimes those files
  get their sizes  increased in 100% or more. When the same file is signed
  using 1.8 the file is increased in a expected way.
 
  Original File: https://www.dropbox.com/s/s8p40ukorhchtcu/sign_me.pdf?dl=0
 
  Signed With 1.8:
  https://www.dropbox.com/s/ty8axylq8ol6204/sign_me_signed_1.8.pdf?dl=0
 
  Signed With 2.0:
  https://www.dropbox.com/s/ge1x3mdpqlalnvq/sign_me_signed_2.0.pdf?dl=0
 
  There is some option to reduce the signed file size on 2.0?
 
  Best regards
 
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-26 Thread Andreas Lehmkühler

Hi,

 Steve Antoch sant...@yuzu.com hat am 25. Februar 2015 um 00:04 geschrieben:
 
 
 Hi Andreas-
 
 Thanks again.
 
 I downloaded and built the latest from trunk.  
 There was no change for the book I was testing.  I first tried it after taking
 out my if (streamOffset  0) test, but the null reference exception still
 occurred.
OK, thanks again for testing. I've fixed the issue based on your analysis.

 We are planning on running a large breadth test on approximately 108,000 pdfs
 starting tonight.  I will let you know how this test goes.  It will take about
 4 days to complete.
Cool, I'm looking forward to see the results.

 With respect to the small change I made in my fork:
 https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
 
 The issue was a separate but fairly rare failure that we found in a small
 number (about 10) of our pdfs.
 Adobe and Pdfium (Chrome) were both able to open them but pdfBox was not due
 to disallowing nesting.  I figured that if Pdfium allows 64 levels of nesting,
 we might be able to relax this test from 0 levels to allowing 1 level and see
 if it worked.  Since it did, I wanted to run those changes by you for your
 comments.
Is there any chance to get a hand on a sample pdf? I would be good enough to
send it via private mail to me:

BR
Andreas Lehmkühler

 
 Thanks-
 Steve
 
 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Tuesday, February 24, 2015 3:30 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)
 
 Hi Steve,
 
  Steve Antoch sant...@yuzu.com hat am 23. Februar 2015 um 19:42
  geschrieben:
 
 
  @Andreas-
 
  I have downloaded the latest trunk and came close (it got much further)
  before
  failing.
  However, I think I may have a fix for that failure:
 Thanks for the test
 
  The code is returning 0 when the xrefstm fixedOffset is not found.  However,
  the code still tries to load and parse from xref 0, resulting in a null
  reference exception later in parser.parse().
 Your analysis is correct, but I hope that my last improvements should
 eliminate
 such cases, see PDFBOX-2572 for details. Could you give the latest trunk
 (r1661747) a try?
 
  However, thinking about this, I came up with this:
 
  // check for a XRef stream, it may contain some object ids
  of
  compressed objects
  if(trailer.containsKey(COSName.XREF_STM))
  {
  int streamOffset = trailer.getInt(COSName.XREF_STM);
  // check the xref stream reference
  fixedOffset = checkXRefStreamOffset(streamOffset,
  false);
//== fixedoffset comes back as 0 = not found
  if (fixedOffset  -1  fixedOffset != streamOffset)
  {
  streamOffset = (int)fixedOffset;
// == streamOffset gets set
  to
  0 here
  trailer.setInt(COSName.XREF_STM, streamOffset);
  }
 
  if (streamOffset  0)//  I added this test
  because an xref stream starting at
 //  offset 0 can
  never happen, so we should simply skip it
  {
  pdfSource.seek(streamOffset);
  skipSpaces();
  parseXrefObjStream(prev, false);  == this call
  ultimately throws a null ref exception if streamOffset == 0 on entry
  }
  }
 
  Adding that, the file successfully parses.
 
  Also, there was this proposal that I put up on github in a repo that I
  directly forked from pdfbox (it is the only change)
  It relaxes the looping a bit to allow limited recursion.  I would appreciate
  your thoughts on it.
 Is this change related to the discussed issue above?
 
  https://github.com/santoch/pdfbox/commit/75cc32ab8307062709c30f1cfea5e2fdb8c00ddd
 
  Thank you so much!  You have been tremendously helpful.  I wish I could have
  given you the files, but unfortunately, they are proprietary and we cannot
  release them.  :-(
 No need to worry, you are not the only one who is not allowed to share a
 specific pdf.
 
  Best regards-
  Steve
 
 BR
 Andreas Lehmkühler
 
 
  
  From: Andreas Lehmkühler andr...@lehmi.de
  Sent: Monday, February 23, 2015 3:43 AM
  To: users@pdfbox.apache.org
  Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
  (or variation of it still present)
 
  Hi,
 
  I've improved the self repair mechnism of the trunk based on Steves report.
 
  @Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue
  still
  persist?
 
  BR
  Andreas Lehmkühler
 
   Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05
   geschrieben

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-23 Thread Andreas Lehmkühler

Hi,

I've improved the self repair mechnism of the trunk based on Steves report.

@Steve Please give the newest trunk version/SNAPSHOT a try. Does the issue still
persist?

BR
Andreas Lehmkühler

 Steve Antoch sant...@yuzu.com hat am 17. Februar 2015 um 00:05 geschrieben:
 
 
 
 Andreas-
 Thanks for the response.
 Sorry for sending directly.
 
 Yes, it tries to read from offset 112085940, but does not find the xrefstm
 there, so 
 that's when it goes searching.  It seems to be landing in the middle of
 something else (perhaps an image?)
 
 I tried running the preflight command on the file, and this is what it found
 there.
 This is in the middle of a whole series of repetitive byte patterns like
 these, which is interspersed with other sections of content that is also
 binary only.
 
 ?xml version=1.0 encoding=UTF-8 standalone=no?
 preflight name=file.pdf
   executionTimeMS2646/executionTimeMS
   isValid type=false/isValid
   errors count=1
 error count=1
   code1.0/code
   detailsSyntax error, Error: Expected a long type at offset 112085940,
 instead got
 '6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ³fÍ#155;6lÙ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'/details
 /error
   /errors
 /preflight
 
 The patterns seem to be:
 
 lots of these: 6lÙ³fÍ#155;
 interspersed between blocks that are similar to this:
 ±¯Óz·C#156;3Í}#14;y#11;ó#3;£g#130;?1º·Ó#158;-ó#143;VÏ:ë½NsË#142;¸#31;6lÙ³fÅ#ë#147;#29;#31;¨Î÷å.£=#137;ù}ÕsÞÿ'
 
 It just so happens that the offset 112085940 falls right in the middle of a
 big block of those 6lÙ³fÍ#155; repetitive blocks.
 
 Not sure if that's any help. 
 
 Steve
 
 
 From: Andreas Lehmkühler andr...@lehmi.de
 Sent: Monday, February 16, 2015 3:34 AM
 To: users@pdfbox.apache.org
 Subject: Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present
 (or variation of it still present)
 
 Hi,
 
  Steve Antoch sant...@yuzu.com hat am 13. Februar 2015 um 23:34
  geschrieben:
 
 
 
  Hi Tilman and Andreas--
 Please don't contact developers directly, use our mailing lists instead. I've
 put the users list back into the boat...
 
  I am working with Krasimir on this issue.
 
  Although we asked, we were denied permission to send the document out.
 :-(
 
  The failure is being triggered when we attempt to use the Encrypt() class to
  password protect the pdf.
  We end up with the Expected a long type at offset 113884174, instead got
  'xref' failure.
 
  I have debugged into the PDFBox code and found the offending parts.
 
  PdfBox is  trying to parse an xref table located at 113884174.
 
  The problem we are seeing is that the inside the trailer it finds the
  /XRefStm
  label, and its offset value is returned as 112085940 (which is what is given
  in the file),
  However, the checkXRefOffset() call made to verify it doesn't find the xref
  stream there, so it goes searching and ends up returning the closest xref
  offset it can find, which happens to be that it returns its own offset at
  113884174.
 
 
  I believe that there is an error in PdfBox with respect to this fixup logic,
  even if it had found the 'correct' xref stream.
  That is because the fixup offset can NEVER work.  Every time it fixes up the
  location, it lands on a section which begins with xref.
  The next call is to skip the whitespace, but since there is never any there
  (it's already proven to be 'xref'),  it does not advance the input stream.
  Then, the first call to parse that xrefstm always calls readObjectID(),
  which
  always will throw the exception because the bytes are always 'xref'.
 
  So, my questions are:
 
  1) Are these docs fixable or are they truly corrupt?
 Without having a hand on the pdf itself it's hard to give a 100% answer. But I
 guess there has to be fix, as adobe is able to open that pdf. I'll try to find
 one, following your description of the pdf
 
  2) Is this xref issue a known issue with PdfBox?  I would try to create a
  document that displays the error but I honesty don't know how to do so
  (beyond
  sending the ones that we have that DO display it).
 Not until now
 
  3) Do you have any idea how these documents end up in this state if they are
  being edited by tools such as InDesign, Acrobat, etc? Is there something I
  can
  do to identify them?
 There are a lot of more or less corrupt files in the wild. Those are created
 using different tools.
 
  4) If this is a truly corrupted document, why would Acrobat be able to open
  these files but pdfBox cannot?  Are these streams somehow ignorable?  I ask
  this because I saw this statement on a web page
   (http://resources.infosecinstitute.com/pdf-file-format-basic-structure/)
  when
  I

Re: How to attach files to messages sent to users@pdfbox.apache.org?

2015-02-17 Thread Andreas Lehmkühler

Hi Alan,

(most kind of) attachments are not allowed. Either attach the file(s) to the
related JIRA ticket or provide it using a sharehoster/public webspace/etc.

BR
Andreas Lehmkühler

 Alan Masters amast...@nhbc.co.uk hat am 17. Februar 2015 um 11:10
 geschrieben:
 
 
 Please could someone help?
 
 I have attached files to better describe my problem, but when they are fed
 back to me and presumably to other members of the forum, the attachments are
 missing.
 
 
 Alan Masters | Principal Analyst / Programmer | IT Department
 Direct tel: 01908 747126 | email: amast...@nhbc.co.uk
 NHBC | NHBC House | Davy Avenue | Knowlhill | Milton Keynes | Bucks | MK5 8FP
 | Tel: 0844 633 1000 | www.nhbc.co.ukhttp://www.nhbc.co.uk
 
 
 
 This email is confidential and is intended for the addressee only. If you are
 not the addressee, please delete the email and do not use it in any way.
 Please note that any views or opinions presented in this email are solely
 those of the author and do not necessarily represent those of the company.
 NHBC reserves the right to monitor all email communications. The recipient
 should check this email and any attachments for the presence of viruses. The
 company accepts no liability for any damage caused by any virus transmitted by
 this email. NHBC, the National House-Building Council, is limited by guarantee
 in England, No 320784. Registered address: NHBC House, Davy Avenue, Knowlhill,
 Milton Keynes MK5 8FP. NHBC is authorised by the Prudential Regulation
 Authority and regulated by the Financial Conduct Authority and Prudential
 Regulation Authority. NHBC Building Control Services Ltd, registered by
 guarantee in England with Company No. 01952969. Registered address: NHBC
 House, Davy Avenue, Knowlhill, Milton Keynes MK5 8FP. NHBC Services Ltd
 registered by guarantee in England, No 03067703. Registered address: NHBC
 House, Davy Avenue, Knowlhill, Milton Keynes MK5 8FP. If you make a claim
 under a Buildmark policy your personal details will be stored and processed in
 accordance with the Data Protection Act. Your personal details may be passed
 to others involved with your claim such as the original builder, or a
 consultant or remedial works contractor that we may employ in connection with
 your claim(s) and matter ancillary to your claim(s). Other than disclosure
 provided for in this statement, we will not pass any data about you to any
 other party without your permission unless we are required to do so by law.

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: https://issues.apache.org/jira/browse/PDFBOX-2523 still present (or variation of it still present)

2015-02-16 Thread Andreas Lehmkühler

 the last trailer.
     trailerOffset = pdfSource.getOffset();
     // PDFBOX-1739 skip extra xref entries in RegisSTAR documents
     while (isLenient  pdfSource.peek() != 't')
     {
     if (pdfSource.getOffset() == trailerOffset)
     {
     // warn only the first time
     LOG.warn(Expected trailer object at position  +
 trailerOffset
     + , keep trying);
     }
     readLine();
     }
     if (!parseTrailer())
     {
     throw new IOException(Expected trailer object at
 position: 
     + pdfSource.getOffset());
     }
     COSDictionary trailer =
 xrefTrailerResolver.getCurrentTrailer();
     // check for a XRef stream, it may contain some object ids of
 compressed objects
     if(trailer.containsKey(COSName.XREF_STM))  == YES - but falue
     {
     int streamOffset = trailer.getInt(COSName.XREF_STM);  ==
 This returns 112085940, which is the value from the trailer
     // check the xref stream reference
     fixedOffset = checkXRefOffset(streamOffset);  ==
 checks it and returns 113884174 instead
     if (fixedOffset  -1  fixedOffset != streamOffset)
     {
     streamOffset = (int)fixedOffset;
     trailer.setInt(COSName.XREF_STM, streamOffset);
     }
     pdfSource.seek(streamOffset);  == Seeks to 113884174
     //readExpectedString(XREF_TABLE, false); 
     skipSpaces();    ===  It's ON xref, so it doesn't
 skip anything
     parseXrefObjStream(prev, false); == goes in here, first
 thing it tries to do is readObjectNumber(), which can't work because it's
 'xref' -- BOOM
     }
     prev = trailer.getInt(COSName.PREV);
     if (prev  -1)
     {
     // check the xref table reference
     fixedOffset = checkXRefOffset(prev);
     if (fixedOffset  -1  fixedOffset != prev)
     {
     prev = fixedOffset;
     trailer.setLong(COSName.PREV, prev);
     }
     }
     }
     else
     {
     // parse xref stream
     prev = parseXrefObjStream(prev, true);
     if (prev  -1)
     {
     // check the xref table reference
     fixedOffset = checkXRefOffset(prev);
     if (fixedOffset  -1  fixedOffset != prev)
     {
     prev = fixedOffset;
     COSDictionary trailer =
 xrefTrailerResolver.getCurrentTrailer();
     trailer.setLong(COSName.PREV, prev);
     }
     }
     }
     }
     //  build valid xrefs out of the xref chain
     xrefTrailerResolver.setStartxref(startXrefOffset);
     COSDictionary trailer = xrefTrailerResolver.getTrailer();
     document.setTrailer(trailer);
     document.setIsXRefStream(XRefType.STREAM ==
 xrefTrailerResolver.getXrefType());
     // check the offsets of all referenced objects
     checkXrefOffsets();
     // copy xref table
     document.addXRefTable(xrefTrailerResolver.getXrefTable());
     return trailer;
     }


BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: [PDFBOX-2.0] Signature Issue

2015-02-04 Thread Andreas Lehmkühler

Hi,

 Isaias Barroso isaias.barr...@gmail.com hat am 3. Februar 2015 um 22:58
 geschrieben:
 
 
 Hi Andreas,
 
 Now, the request to save file on close doesn't occurs but now the signature
 appears as Invalid in Adobe Forms. I'm using the
 pdfbox-2.0.0-20150203.200017-1038.jar and
 pdfbox-examples-2.0.0-20150203.200144-1010.jar SNAPSHOTS.
Thanks for the check. I'll have a look in the remaining issue later.

 Bouncycastle 1.51 are being used. The keystore is the same used on PDBBOX
 test resource directory.
 
 I'm sending the result file.
Attachments are allowed. Either you'll upload the file to a public place or send
it to me directly.

BR
Andreas Lehmkühler

 Best regards
 
 On Mon, Feb 2, 2015 at 6:13 PM, Isaias Barroso isaias.barr...@gmail.com
 wrote:
 
  Thank you,
 
  After test I'll give a feedback.
 
  BR
 
  On Mon, Feb 2, 2015 at 6:05 PM, Andreas Lehmkuehler andr...@lehmi.de
  wrote:
 
  Hi,
 
  Am 02.02.2015 um 20:24 schrieb Isaias Barroso:
 
  Hi Andreas,
 
  The SNAPSHOT (pdfbox-2.0.0-20150202.110005-1034) for today already
  contains
  the fixed code?
 
  I'm afraid not. You have to wait for the next succesful build.
 
  BR
  Andreas Lehmkühler
 
 
   BR
 
  On Mon, Feb 2, 2015 at 5:12 PM, Andreas Lehmkuehler andr...@lehmi.de
  wrote:
 
   Hi,
 
 
  Am 29.01.2015 um 16:10 schrieb Isaias Barroso:
 
   Hi Ruben,
 
  I think it isn't the same problem, because the file is correctly signed
  using PDFBOX 1.8.8 and BouncyCastle 1.45.
 
   I guess the problem was a missing trailer. I've fixed that in the
  trunk,
  see [1] for further details.
 
  Please, double check if everything is fine now.
 
  BR
  Andreas Lehmkühler
 
  [1] https://issues.apache.org/jira/browse/PDFBOX-2656
 
 
   Best regards
 
 
  On Thu, Jan 29, 2015 at 12:05 PM, Ruben Lagar ruben.la...@gmail.com
  wrote:
 
Hi Isaias,
 
 
  I had a similar problem, and I think it is related to the problem
  described
  in this Jira
 
  https://issues.apache.org/jira/browse/PDFBOX-1822
 
  There is no fix yet, as far as I know.
 
 
  El Thu Jan 29 2015 at 1:39:40 PM, Isaias Barroso (
  isaias.barr...@gmail.com)
  escribió:
 
Hi Andreas,
 
 
  I got the updated SNAPSHOT (pdfbox-2.0.0-20150129.080600-1013.jar)
  and
  used the sign_me.pdf, keystore.p12 provided on test case. Follow the
 
   result
 
   file, now Adobe Reader says that the signature is invalid and when I
 
   close
 
   the save message appears.
  I've tried using the CreateSignature.class of
  pdfbox-examples-2.0.0-20150129.080737-985.jar SNAPSHOT too.
 
  BouncyCastle 1.51 are being used.
 
  Best regards
 
 
  On Thu, Jan 29, 2015 at 9:53 AM, Andreas Lehmkühler 
  andr...@lehmi.de
  wrote:
 
Hi,
 
 
 
Isaias Barroso isaias.barr...@gmail.com hat am 28. Januar 2015
  um
 
 
   12:35
 
   geschrieben:
 
 
  Hi all,
 
  I'm trying the PDFBOX 2 SNAPSHOT and I have a issue with Signature,
 
   the
 
 
   file is processed and the size are increased but when I open the file
 
 
   on
 
 
   Adobe Reader the signature information aren't showed. When I close
  the
 
 
   an
 
   information that the document was modified appears, so I'm thinking
 
   that
 
 
   process wasn't completed correctly, although none exception are
  thrown
 
 
  To make the tests, I've used a pdfbox-examples snapshot
  (org.apache.pdfbox.examples.signature.CreateSignature)
 
 
 https://repository.apache.org/content/groups/snapshots/org/
 
  apache/pdfbox/pdfbox-examples/2.0.0-SNAPSHOT/
 
   What exact SNAPSHOT version did you use as there were recently some
 
  changes.
 
Do you have any suggestion to investigate the root cause?
 
 
   What exactly did you do to sign the pdf? Did you have a look at
  the
  provided
  testcase [1], which demonstrates all necessary steps to sign a pdf.
 
Best regards
 
 
  --
  Isaías Barroso
  Belo Horizonte - MG
 
 
  BR
  Andreas Lehmkühler
 
  [1]
 
 
http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/
 
  test/java/org/apache/pdfbox/examples/pdmodel/
  TestCreateSignature.java?view=markup
 
 
 
 
 
  --
  Isaías Barroso
  Belo Horizonte - MG
 
  
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
 
 
 
 
 
 
 
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
 
 
 
 
 
 
  -
  To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
  For additional commands, e-mail: users-h...@pdfbox.apache.org
 
 
 
 
  --
  Isaías Barroso
  Belo Horizonte - MG
 
 
 
 
 -- 
 Isaías Barroso
 Belo Horizonte - MG
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h

Re: [PdfBox 2.0] Page rendered as a blank image

2015-02-03 Thread Andreas Lehmkühler

Hi

 Kevin Morin mo...@codelutin.com hat am 3. Februar 2015 um 11:57 geschrieben:
 
 
 Hi,
 
 I tried to render a pdf as an image, but one of the page is rendered 
 blank. Here are the traces :
 2015/02/03 11:53:26 ERROR (org.apache.pdfbox.pdfparser.COSParser:1169) - 
 Can't find the object 10 0 (origin offset 724158)
 2015/02/03 11:53:26 ERROR 
 (org.apache.pdfbox.contentstream.PDFStreamEngine:840) - Missing XObject: Im1
 
 Tell me who I can send the pdf in private, I cannot send it publically.
It looks like a parser issue, please send it to me.

 Thanks
 BR
 
 Kevin
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org
 

BR
Andreas Lehmkühler

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: [PDFBOX-2.0] Signature Issue

2015-01-29 Thread Andreas Lehmkühler

Hi,


 Isaias Barroso isaias.barr...@gmail.com hat am 28. Januar 2015 um 12:35
 geschrieben:
 
 
 Hi all,
 
 I'm trying the PDFBOX 2 SNAPSHOT and I have a issue with Signature, the
 file is processed and the size are increased but when I open the file on
 Adobe Reader the signature information aren't showed. When I close the an
 information that the document was modified appears, so I'm thinking that
 process wasn't completed correctly, although none exception are thrown
 
 To make the tests, I've used a pdfbox-examples snapshot
 (org.apache.pdfbox.examples.signature.CreateSignature)
 https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-examples/2.0.0-SNAPSHOT/
What exact SNAPSHOT version did you use as there were recently some changes.

 Do you have any suggestion to investigate the root cause?
What exactly did you do to sign the pdf? Did you have a look at the provided
testcase [1], which demonstrates all necessary steps to sign a pdf.

 Best regards
 
 -- 
 Isaías Barroso
 Belo Horizonte - MG

BR
Andreas Lehmkühler

[1]
http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/test/java/org/apache/pdfbox/examples/pdmodel/TestCreateSignature.java?view=markup

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Aw: Re: Type1Glyph2D No glyph for 41 (.notdef) in font Helvetica

2015-01-29 Thread Andreas Lehmkühler

Hi,

 Andreas Lüdtke andi.lued...@gmx.de hat am 29. Januar 2015 um 08:51
 geschrieben:
 
 
 Hi Tilman,
 
 you will find the pdf file here:
 https://www.dropbox.com/s/4v6tnroz6a8imsp/rg-1234567890BA.pdf?dl=0
 The converted image is here:
 https://www.dropbox.com/s/rqnuou03elxrgb6/rg-1234567890BA1.jpg?dl=0
 
 In this case I used pdfbox-app-2.0.0-20150127.230110-988.jar to generate the
 image but the result is the same when I use my app.
 
 BTW: if the pdf has all fonts embedded, I don't have this problem.
There seems to be an issue with our font mapping if the fonts aren't embedded. 
Besides, IMO you have to think about your font handling, especially as you're
creating the pdfs yourself. It's always a bad idea not to include the used
fonts, as the used reader has to map the missing fonts somehow and such a
replacement may lead to not that perfect renderings.

BR
Andreas Lehmkühler

 Gesendet: Mittwoch, 28. Januar 2015 um 17:45 Uhr
 Von: Tilman Hausherr thaush...@t-online.de
 An: users@pdfbox.apache.org
 Betreff: Re: Type1Glyph2D No glyph for 41 (.notdef) in font Helvetica
 Please upload a sample file somewhere and post the url
 
 Tilman
 
 Am 28.01.2015 um 10:37 schrieb Andreas Lüdtke:
  Hi,
 
  I'm using pdfbox 2.0.0 version trunk from yesterday and I get a lot of such
  warning messages when I convert a pdf file to an image. The pdf file has NO
  embedded fonts.
 
  The resulting images are pretty empty beside some images and lines: no
  single character is visible. I read somewhere that current versions of
  pdfbox 2.0.0 should handle these fonts properly, but I can't confirm this. I
  use jdk 1.7.0_72 on windows 8.1 64bit.
 
  How can I make the characters visible in the converted images?
 
  Best regards
 
  Andreas
 
 -
 To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
 For additional commands, e-mail: users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Error on PDDocument.load

2015-01-21 Thread Andreas Lehmkühler

Hi,

Kevin Morin mo...@codelutin.com hat am 21. Januar 2015 um 12:14 geschrieben:

I thought I was running java 7 but it's java 8... I tried with java 7
and it works. I do not need it to work with java 8, java 7 is ok for me.
It works for me using java 8 on win7 and linux as well. I guess, the issue has
to be something else

BR
Andreas Lehmkühler

Thanks for your help and for all your work.

Kevin

On 21/01/2015 11:54, Maruan Sahyoun wrote:
Hi Kevin

works for me - what's your Java Version?

BR
Maruan

Am 21.01.2015 um 11:24 schrieb Kevin Morin mo...@codelutin.com:

Hi,

it does not work with PDFToImage either, I still get a blank image. Plus, I
did not set the nonSeq option however it seems to be using the non
sequential parser. And I have the following traces:
janv. 21, 2015 11:20:02 AM
org.apache.pdfbox.pdfparser.NonSequentialPDFParser ch
eckXrefOffsets
GRAVE: Can't find the object 7 0 (origin offset 359138)
janv. 21, 2015 11:20:03 AM org.apache.pdfbox.contentstream.PDFStreamEngine
opera
torException
GRAVE: Missing XObject: Im1

Kevin

On 21/01/2015 11:11, Maruan Sahyoun wrote:
Hi Kevin,

you can test with the PDFToImage command [1] available in from the
pdfbox-app [2] if the issue happens there. The source for PDFToImage is
available in the tools section of the SVN repo or online viewable [3].

BR
Maruan

[1] https://pdfbox.apache.org/1.8/commandline.html#pdfToImage
[2]
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
[3]
http://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java?view=markup

Am 21.01.2015 um 11:00 schrieb Kevin Morin mo...@codelutin.com:

Hi Andreas,

I am using the latest snapshot available on the maven repository. And I
am running my app on Windows Server 2008 R2 Standard and it does not work
(white page). Could send me the code or a jar to test on this server to
check if it does not come from my code?

Kevin

On 19/01/2015 19:13, Andreas Lehmkuehler wrote:
Hi,

Am 19.01.2015 um 12:45 schrieb Kevin Morin:
Actually, the issue is not only these traces. The real issue is that I
have a
blank image when I try to render the document.
I've checked your PDF and everything renders fine. I've tried
SNAPSHOT-891 on linux (running java 1.8, 1.7 and 1.6) and the latest
SNAPSHOT-947 on win7 running java 1.7

Maybe your SNAPSHOT is outdated?

BR
Andreas Lehmkühler

On 19/01/2015 12:39, Kevin Morin wrote:
Hi,

I am using the 2.0 snapshot version to images of pdfs, but on some
documents, I have the following error when I call
PDDocument.load(file):
2015/01/19 12:32:48 ERROR
(org.apache.pdfbox.pdfparser.NonSequentialPDFParser:1864) - Can't find
the object 7 0 (origin offset 359138)
2015/01/19 12:32:48 ERROR
(org.apache.pdfbox.contentstream.PDFStreamEngine:840) - Missing
XObject:
Im1

I first had it a few days ago (I did not report it, shame on me) but
the
error did not occur when I called the loadLegacy method on PDDocument.
But the loadLegacy method is not available anymore...

The issue happens on Windows (works fine on Debian).

Thanks fo your help

Kevin

Re: unsubscribe [SEC=UNOFFICIAL]

2015-01-12 Thread Andreas Lehmkühler

Hi James,

to unsubscribe you have to write an email to users-subscr...@pdfbox.apache.org.
See [1] for further details.

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/mailinglists.html


 Weatherly, James james.weathe...@humanservices.gov.au hat am 12. Januar
 2015 um 00:58 geschrieben:
 
 
 
 
 **
 IMPORTANT: This e-mail is for the use of the intended recipient only and may
 contain information that is confidential, commercially valuable and/or subject
 to legal or parliamentary privilege. If you are not the intended recipient you
 are notified that any review, re-transmission, disclosure, dissemination or
 other use of, or taking of any action in reliance upon, this information is
 prohibited and may result in severe penalties. If you have received this
 e-mail in error please notify the sender immediately and delete all electronic
 and hard copies of this transmission together with any attachments. Please
 consider the environment before printing this e-mail
 **

Re: Content of pdf moved around

2015-01-11 Thread Andreas Lehmkühler

Hi Ray,

to unsubscribe you have to write an email to users-subscr...@pdfbox.apache.org.
See [1] for fruther details.

BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/mailinglists.html

 Ray Morris ray.morris.brisb...@bigpond.com hat am 10. Januar 2015 um 22:48
 geschrieben:
 
 
 Please unsubscribe ray.morris.brisb...@bigpond.com
 
 I briefly had the ambition to teach myself how to maintain bookmarks and XML 
 metadata for sheet music libraries but gave up that idea because of the 
 complexity of PDF files.
 
 -Original Message- 
 From: Tilman Hausherr
 Sent: Saturday, January 10, 2015 11:24 PM
 To: users@pdfbox.apache.org
 Subject: Re: Content of pdf moved around
 
 Hi,
 
 The PDF didn't go through (never does), but you can try to use
 PDFTextStripper.setSortByPosition().
 
 Tilman|*
 *|
 Am 10.01.2015 um 14:04 schrieb Renaud Billen:
  Hello,
 
  I have a little issue with the extraction of the text of some pdfs, where 
  some words are switching order with others..
 
  With the pdf attached to this mail, if I use save as text » from adobe 
  reader, I get :
 
  Référence: LIX-673LIX-6737
 
 
  Nom: The test company
 
 
  Type:
  Ouverture: 24/04/2007
 
  Titulaire: BD
  Resp.: LIX
  Co-Resp.: BB
  Client
 
 
 
 
  But with pdfbox I get :
 
  Référence: LIX-6737
  Nom: The test company
  Titulaire: BD
  Resp.: LIX
  Co-Resp.: BB
  Type:
  Ouverture: 24/04/2007
  Client
 
 
  Could you tell me if something can be done to solve this problem?
 
  Thanks,
  Renaud

Re: What all is in the standalone JAR?

2015-01-07 Thread Andreas Lehmkühler

Hi Thib,


 Thib Guicherd-Callin t...@cs.stanford.edu hat am 7. Januar 2015 um 02:03
 geschrieben:
 
 
 
 Hi folks,
 
 The standalone JAR (pdfbox-app-1.8.8.jar) looks tantalizingly like the 
 union of all the various components that make up PDFBox, but apparently 
 not. What classes does and doesn't it contain, exactly?
The standalone jar contains all stuff which is needed to run pdfbox standalone

 Evidently, it contains all the class files from fontbox-1.8.8.jar, 
 jempbox-1.8.8.jar and pdfbox-1.8.8.jar, but none of the class files from 
 xmpbox-1.8.8.jar or preflight-1.8.8.jar. Is that correct?
There is another standalone jar for the preflight stuff.

 I was able to verify that it also contains all the class files from 
 Commons Logging 1.1.3 (the longtime stable version prior to 1.2). Is 
 that correct?
 
 Then I got lost with ICU4J. Obviously it contains tons of ICU4J class 
 files, but the list of class file names differs quite a lot from recent 
 ICU4J JARs. (I tried from 54.1.1 to 51.2.) The data files under 
 com/ibm/icu/impl/data/icudt38b/ could imply this is from ICU4J 3.8 or 
 3.8.1, and while a much better match, it's still slightly different. 
 What version is included in there? (As a side question, I'm no proponent 
 of trying to stay at the bleeding edge of dependencies, but if it's 
 really an ICU4J version from 2007, are there any plans to upgrade the 
 version of ICU4J included in the standalone JAR?)
ICU4J won't be part of the next major release of PDFBox anymore.

 Basically same question with Bouncy Castle. I went back to 1.46 and the 
 number of difference in class file names goes as I went back in Bouncy 
 Castle history but still quite a lot of differences. What version is in 
 there? (And if it's a really old version, are there any plans to upgrade 
 the version included in the standalone JAR?)
It seems to be 1.44 for PDFBox 1.8.8, the current trunk was updated to 1.51

Especially BC can't be updated that easy, as there are some API-changes
from time to time so that we can't just replace the used version when
releasing a bugfix release of PDFBox.

 Are there other bundles of class files in the standalone JAR that I 
 didn't notice? (I was looking at the Dependencies page for clues.)
Our documentation isn't really complete, but we are working on that.

For now you may have a look at the corresponding pom files, where all
dependencies are defined including the version.

 Thanks,
 
 Thib

BR
Andreas Lehmkühler

1 2 >

1 - 100 of 179 matches

Mail list logo