Re: Help with NullPointerException org.apache.io.IOUtils.LOG

2024-03-15 Thread Tilman Hausherr

Searching for the error message I found this in a comment:

https://stackoverflow.com/questions/69151291/java-16-modularisation-illegalaccessexception-java-nio-spring-boot

|--add-opens java.base/java.nio=ALL-UNNAMED --add-opens 
java.base/jdk.internal.ref=ALL-UNNAMED|



Tilman

On 15.03.2024 18:48, Matthew Hardy wrote:

Hi Andreas,

I've upgraded to pdfbox 3.0.2, I'm no longer getting the 
ExceptionInilizationError when instantiating an empty PDDocument. However, I'm 
now receiving this error message-

ERROR [org.apache.pdfbox.io.IOUtils] (EE-ManagedExecutorService-default-Thread-1) 
Unmapping is not supported.: java.lang.reflect.InaccessibleObjectException: Unable to 
make public jdk.internal.ref.Cleaner java.nio.DirectByteBuffer.cleaner() accessible: 
module java.base does not "opens java.nio" to unnamed module @18f5234c

The PDDocument still instantiates, and I'm able to use it, but I'm concerned 
about this error message.

Matt Hardy
Software Developer
Perform Air International
463 South Hamilton Court
Gilbert, Arizona 85233
Phone: (480) 610-3500
Fax: (480) 610-3501
matt.ha...@performair.com
www.PerformAir.com

-Original Message-
From: Andreas Lehmkühler  
Sent: Tuesday, March 12, 2024 9:50 AM

To:users@pdfbox.apache.org
Subject: Re: Help with NullPointerException org.apache.io.IOUtils.LOG

Hi Matthew,

this is a known issue with 3.0.1, see [1] for further details.

The upcoming version 3.0.2 includes a fix. Unless nothing unforeseen happens, 
the new version will be available in about 2 days from now.

Andreas

[1]https://issues.apache.org/jira/browse/PDFBOX-5758


Am 12.03.24 um 17:40 schrieb Matthew Hardy:

Hello,

We've recently upgraded to pdfbox 3.0.1. When attempting to instantiate an 
empty PDDocument, we receive the following error.

Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.commons.logging.Log.error(Object, java.lang.Throwable)" because 
"org.apache.pdfbox.io.IOUtils.LOG" is null
  at 
deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.unmapper(IOUtils.java:278)
  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
  at
deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.(
IOUtils.java:64)

This is a Jakarta EE 10 EJB maven project, running on Java 17 in Wildfly 
30.0.1.Final. commons-logging 1.2 has been added as a dependency.

Any help would be greatly appreciated!

Matt Hardy
Software Developer
Perform Air International
463 South Hamilton Court
Gilbert, Arizona 85233
Phone: (480) 610-3500
Fax: (480) 610-3501
matt.ha...@performair.com
www.PerformAir.com



-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org


-
To unsubscribe, e-mail:users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail:users-h...@pdfbox.apache.org



RE: Help with NullPointerException org.apache.io.IOUtils.LOG

2024-03-15 Thread Matthew Hardy
Hi Andreas,

I've upgraded to pdfbox 3.0.2, I'm no longer getting the 
ExceptionInilizationError when instantiating an empty PDDocument. However, I'm 
now receiving this error message-

ERROR [org.apache.pdfbox.io.IOUtils] 
(EE-ManagedExecutorService-default-Thread-1) Unmapping is not supported.: 
java.lang.reflect.InaccessibleObjectException: Unable to make public 
jdk.internal.ref.Cleaner java.nio.DirectByteBuffer.cleaner() accessible: module 
java.base does not "opens java.nio" to unnamed module @18f5234c

The PDDocument still instantiates, and I'm able to use it, but I'm concerned 
about this error message.

Matt Hardy
Software Developer
Perform Air International
463 South Hamilton Court
Gilbert, Arizona 85233
Phone: (480) 610-3500
Fax: (480) 610-3501
matt.ha...@performair.com
www.PerformAir.com

-Original Message-
From: Andreas Lehmkühler  
Sent: Tuesday, March 12, 2024 9:50 AM
To: users@pdfbox.apache.org
Subject: Re: Help with NullPointerException org.apache.io.IOUtils.LOG

Hi Matthew,

this is a known issue with 3.0.1, see [1] for further details.

The upcoming version 3.0.2 includes a fix. Unless nothing unforeseen happens, 
the new version will be available in about 2 days from now.

Andreas

[1] https://issues.apache.org/jira/browse/PDFBOX-5758


Am 12.03.24 um 17:40 schrieb Matthew Hardy:
> Hello,
> 
> We've recently upgraded to pdfbox 3.0.1. When attempting to instantiate an 
> empty PDDocument, we receive the following error.
> 
> Caused by: java.lang.NullPointerException: Cannot invoke 
> "org.apache.commons.logging.Log.error(Object, java.lang.Throwable)" because 
> "org.apache.pdfbox.io.IOUtils.LOG" is null
>  at 
> deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.unmapper(IOUtils.java:278)
>  at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
>  at 
> deployment.aeroxchange-edi.ear//org.apache.pdfbox.io.IOUtils.(
> IOUtils.java:64)
> 
> This is a Jakarta EE 10 EJB maven project, running on Java 17 in Wildfly 
> 30.0.1.Final. commons-logging 1.2 has been added as a dependency.
> 
> Any help would be greatly appreciated!
> 
> Matt Hardy
> Software Developer
> Perform Air International
> 463 South Hamilton Court
> Gilbert, Arizona 85233
> Phone: (480) 610-3500
> Fax: (480) 610-3501
> matt.ha...@performair.com
> www.PerformAir.com
> 
> 

-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: AFMParser optimization

2024-03-15 Thread Tilman Hausherr

Hi,

Thank you, done.

Tilman

On 15.03.2024 14:49, Guillaume Maillrd wrote:

Hi,

During a profiling session of my application, I found something that 
could interest you.


To speedup the AFMParser (50% gain),
the "equals" in parseCharMetric should be written in this order ( 
order of top 5 usage) :


if (nextCommand.equals(CHARMETRICS_C)) {
...
} else if (nextCommand.equals(CHARMETRICS_WX)) {
...
} else if (nextCommand.equals(CHARMETRICS_N)) {
...
} else if (nextCommand.equals(CHARMETRICS_B)) {
...
} else if (nextCommand.equals(CHARMETRICS_L)) {
...
} ...

On my setup, it removes 80k calls to "equals".

Regards,

Guillaume




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Type 0 font - Text extraction X PDF Debugger

2024-03-15 Thread Tilman Hausherr
Yes identity does work for that file. However using that logic fails to 
provide the correct results for other files with an unusuable /ToUnicode 
stream.


Yes there can be larger blocks.

My suspicion is that the tools who use "identity" for your file will 
fail for some of the files. Unless we discover yet another tweak of that 
workaround algorithm that works with all.


Tilman

On 15.03.2024 14:28, Luiz Marcelo Modesto wrote:

Thank you Tilman!

I'll try to read ISO 32000-2:2020 again to look for some kind of precedence
rules regarding the way of decoding string codes to Unicode chars.

My impression is that there are some choices but I don't remember if there
is something assertive or not. Maybe it could be just an implementation
choice.

I'll try to debug the extraction text tool to verify why using the
predefined Identity CMap works.

If I've looked at the correct CMap file
(fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
lot of blocks of beginbfchar/endbfchar. It doesn't have any
beginbfchar/endbfchar block.

All the blocks have their length limited to 256 codes, but it seems PDFBox
can support larger blocks. But, maybe the set "<0100>  256" could be
a problem...

PS.: The use of "true" was just a fast and dirty way to do a fast test, as
the beginbfchar/endbfchar block suggested to me an identity mapping.




Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr 
escreveu:


You are correct that it's the "fb" parts that are missing. (And some of
the other tools you tried also mention this)

Just adding true results in text extraction of several files no longer
being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf

Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
no regressions but your text is not extracted properly.

Maybe it is possible to include yet another rule for your file, but
there's likely more to do and there is the risk that other files no
longer extract properly.

Tilman

On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:

It seems that PDFBOX-5540 resolves a special case based on some

dictionary

properties and chooses a predefined CMap (Identity CMap).

Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
doesn't contain 1 or more blocks of beginbfchar/endbfchar.

The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
are really empty.

But the font CMap stream contains this block:

2 begincidrange
<0001> <00FF> 1
<0100>  256
endcidrange

I'm sorry if I misunderstood, but this is a valid CMap too (it seems a

kind

of Identity mapping too, except for the 0x00...), isn't it?

It's only shorter than the one I could have if I write several blocks of
beginbfchar/endbfchar.

If I make this "dumb" modification (adding "true" to conditions) just

for a

rapid test

if (cmapName.contains("Identity") //
|| ordering.contains("Identity") //
|| COSName.IDENTITY_H.equals(encoding) //
|| COSName.IDENTITY_V.equals(encoding) || true)
{
COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
if (true || encodingDict == null || !encodingDict.containsKey(COSName.
DIFFERENCES))
{
// assume that if encoding is identity, then the reverse is also true
cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
LOG.warn("Using predefined identity CMap instead");
}
}

I've got "BCD" string like all the others

The encoding parameter is ignored when writing to the console.
mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
loadUnicodeCmap
ADVERTÊNCIA: Using predefined identity CMap instead
Página 4 de 4
Informações:  BCD

Maybe the extract text tool should been using begincidrange/endcidrange
information...

What do you think about?

PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
Maybe I'm missing something... I'm sorry if this is the case...

Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
lmodesto.w...@gmail.com> escreveu:


Ok!

I'll read PDFBOX-5540 and related issues.

Thank you very much!


Em qui, 14 de mar de 2024 10:08, Tilman Hausherr 
Hi,

The problem is in the ToUnicode stream, there's a log message "Invalid
ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
PDFBox is trying a fallback solution which turns out to be wrong. This
is related to PDFBOX-5540 and earlier related issues.

Tilman



On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:

Hi Tilman!

   Thank you very much for your attention!

   You can find the file "p4_alt.pdf" in this folder
<

https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing

.
"Extra infos.pdf" file shows some output from PDF Debugger and others.

   I'm sorry, I sent the 

AFMParser optimization

2024-03-15 Thread Guillaume Maillrd

Hi,

During a profiling session of my application, I found something that 
could interest you.


To speedup the AFMParser (50% gain),
the "equals" in parseCharMetric should be written in this order ( order 
of top 5 usage) :


if (nextCommand.equals(CHARMETRICS_C)) {
...
} else if (nextCommand.equals(CHARMETRICS_WX)) {
...
} else if (nextCommand.equals(CHARMETRICS_N)) {
...
} else if (nextCommand.equals(CHARMETRICS_B)) {
...
} else if (nextCommand.equals(CHARMETRICS_L)) {
...
} ...

On my setup, it removes 80k calls to "equals".

Regards,

Guillaume




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Bugfix for FileSystemFontProvider

2024-03-15 Thread Guillaume Maillrd

Hi,

Thanks, sorry for this duplicate.
I hope 2.0.31 will be released soon.

Regards,

Guillaume

Le 15/03/2024 à 13:51, Tilman Hausherr a écrit :

Hi,

Yeah, "never happens" is a red flag. That part has been changed to use 
CRC32:
https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1916176=markup#l923 



https://issues.apache.org/jira/browse/PDFBOX-5727

Tilman

On 15.03.2024 13:45, Guillaume Maillrd wrote:

Hi,

In version 2.0.30, a typo in computeHash from FileSystemFontProvider 
makes all hash to return "".

It breaks the cache logic, resulting a very slow loadDiskCache.

Please replace "SHA512" by "SHA-512" or backport the v3 code to use 
CRC32.

The "// never happens" comment looks funny.

Best regards,

Guillaume



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Re: Type 0 font - Text extraction X PDF Debugger

2024-03-15 Thread Luiz Marcelo Modesto
Thank you Tilman!

I'll try to read ISO 32000-2:2020 again to look for some kind of precedence
rules regarding the way of decoding string codes to Unicode chars.

My impression is that there are some choices but I don't remember if there
is something assertive or not. Maybe it could be just an implementation
choice.

I'll try to debug the extraction text tool to verify why using the
predefined Identity CMap works.

If I've looked at the correct CMap file
(fontbox/target/classes/org/apache/fontbox/cmap/Identity-H) it also has a
lot of blocks of beginbfchar/endbfchar. It doesn't have any
beginbfchar/endbfchar block.

All the blocks have their length limited to 256 codes, but it seems PDFBox
can support larger blocks. But, maybe the set "<0100>  256" could be
a problem...

PS.: The use of "true" was just a fast and dirty way to do a fast test, as
the beginbfchar/endbfchar block suggested to me an identity mapping.




Em sex., 15 de mar. de 2024 às 01:35, Tilman Hausherr 
escreveu:

> You are correct that it's the "fb" parts that are missing. (And some of
> the other tools you tried also mention this)
>
> Just adding true results in text extraction of several files no longer
> being correct, 433525-p1.pdf O226ORR4SMIKRGPWC6PXUYAYMSBB6FVP-p3.pdf
> PDFBOX-5540.pdf R4EXG25W532JHDQLJAM4HF6O532TLR7D-p1.pdf
>
> Adding  "&& !cmap.hasCIDMappings()" after "hasUnicodeMappings()" brings
> no regressions but your text is not extracted properly.
>
> Maybe it is possible to include yet another rule for your file, but
> there's likely more to do and there is the risk that other files no
> longer extract properly.
>
> Tilman
>
> On 15.03.2024 00:08, Luiz Marcelo Modesto wrote:
> > It seems that PDFBOX-5540 resolves a special case based on some
> dictionary
> > properties and chooses a predefined CMap (Identity CMap).
> >
> > Reading the PDFont.java code, I think the warning "Invalid ToUnicode CMap
> > in font AvenirNextLTPro-Cn" comes from the fact that the CMap stream
> > doesn't contain 1 or more blocks of beginbfchar/endbfchar.
> >
> > The two CMap's HashMaps (charToUnicodeOneByte and charToUnicodeTwoBytes)
> > are really empty.
> >
> > But the font CMap stream contains this block:
> >
> > 2 begincidrange
> > <0001> <00FF> 1
> > <0100>  256
> > endcidrange
> >
> > I'm sorry if I misunderstood, but this is a valid CMap too (it seems a
> kind
> > of Identity mapping too, except for the 0x00...), isn't it?
> >
> > It's only shorter than the one I could have if I write several blocks of
> > beginbfchar/endbfchar.
> >
> > If I make this "dumb" modification (adding "true" to conditions) just
> for a
> > rapid test
> >
> > if (cmapName.contains("Identity") //
> > || ordering.contains("Identity") //
> > || COSName.IDENTITY_H.equals(encoding) //
> > || COSName.IDENTITY_V.equals(encoding) || true)
> > {
> > COSDictionary encodingDict = dict.getCOSDictionary(COSName.ENCODING);
> > if (true || encodingDict == null || !encodingDict.containsKey(COSName.
> > DIFFERENCES))
> > {
> > // assume that if encoding is identity, then the reverse is also true
> > cmap = CMapManager.getPredefinedCMap(COSName.IDENTITY_H.getName());
> > LOG.warn("Using predefined identity CMap instead");
> > }
> > }
> >
> > I've got "BCD" string like all the others
> >
> > The encoding parameter is ignored when writing to the console.
> > mar 14, 2024 7:30:27 PM org.apache.pdfbox.pdmodel.font.PDFont
> > loadUnicodeCmap
> > ADVERTÊNCIA: Invalid ToUnicode CMap in font AvenirNextLTPro-Cn
> > mar 14, 2024 7:31:00 PM org.apache.pdfbox.pdmodel.font.PDFont
> > loadUnicodeCmap
> > ADVERTÊNCIA: Using predefined identity CMap instead
> > Página 4 de 4
> > Informações:  BCD
> >
> > Maybe the extract text tool should been using begincidrange/endcidrange
> > information...
> >
> > What do you think about?
> >
> > PS.: I've read some pieces from ISO 32000-2:2020 but it is quite long.
> > Maybe I'm missing something... I'm sorry if this is the case...
> >
> > Em qui., 14 de mar. de 2024 às 10:30, Luiz Marcelo Modesto <
> > lmodesto.w...@gmail.com> escreveu:
> >
> >> Ok!
> >>
> >> I'll read PDFBOX-5540 and related issues.
> >>
> >> Thank you very much!
> >>
> >>
> >> Em qui, 14 de mar de 2024 10:08, Tilman Hausherr  >
> >> escreveu:
> >>
> >>> Hi,
> >>>
> >>> The problem is in the ToUnicode stream, there's a log message "Invalid
> >>> ToUnicode CMap in font AvenirNextLTPro-Cn". It has no unicode mappings.
> >>> PDFBox is trying a fallback solution which turns out to be wrong. This
> >>> is related to PDFBOX-5540 and earlier related issues.
> >>>
> >>> Tilman
> >>>
> >>>
> >>>
> >>> On 14.03.2024 13:28, Luiz Marcelo Modesto wrote:
>  Hi Tilman!
> 
>    Thank you very much for your attention!
> 
>    You can find the file "p4_alt.pdf" in this folder
>  <
> >>>
> https://drive.google.com/drive/folders/1AjiwYdDEHVEn4h7e53PosIf_QAk6BDoN?usp=sharing
>  .
>  "Extra infos.pdf" file shows some output from PDF Debugger and others.
> 
>    

Re: Bugfix for FileSystemFontProvider

2024-03-15 Thread Tilman Hausherr

Hi,

Yeah, "never happens" is a red flag. That part has been changed to use 
CRC32:

https://svn.apache.org/viewvc/pdfbox/branches/2.0/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java?revision=1916176=markup#l923

https://issues.apache.org/jira/browse/PDFBOX-5727

Tilman

On 15.03.2024 13:45, Guillaume Maillrd wrote:

Hi,

In version 2.0.30, a typo in computeHash from FileSystemFontProvider 
makes all hash to return "".

It breaks the cache logic, resulting a very slow loadDiskCache.

Please replace "SHA512" by "SHA-512" or backport the v3 code to use 
CRC32.

The "// never happens" comment looks funny.

Best regards,

Guillaume



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



Bugfix for FileSystemFontProvider

2024-03-15 Thread Guillaume Maillrd

Hi,

In version 2.0.30, a typo in computeHash from FileSystemFontProvider 
makes all hash to return "".

It breaks the cache logic, resulting a very slow loadDiskCache.

Please replace "SHA512" by "SHA-512" or backport the v3 code to use CRC32.
The "// never happens" comment looks funny.

Best regards,

Guillaume



-
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org