Jenkins build is still unstable: PDFBox » PDFBox-1.8.x #18

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is still unstable: PDFBox » PDFBox-1.8.x » Apache JempBox #18

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300750#comment-17300750
 ] 

Tilman Hausherr commented on PDFBOX-5127:
-

Hmm, this timezone thing needs some more work.

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300748#comment-17300748
 ] 

Tilman Hausherr commented on PDFBOX-5127:
-

[~tallison] you're right, I realize the multithreading is because these files 
both have date formats that are not supported. That's the reason it hasn't hit 
any users in the wild.

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300747#comment-17300747
 ] 

ASF subversion and git services commented on PDFBOX-5127:
-

Commit 1887565 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887565 ]

PDFBOX-5127: set timezone due to failing build test on the ci server

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300746#comment-17300746
 ] 

ASF subversion and git services commented on PDFBOX-5127:
-

Commit 1887564 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887564 ]

PDFBOX-5127: create SimpleDateFormat object every time because it isn't 
thread-safe

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build became unstable: PDFBox » PDFBox-1.8.x » Apache JempBox #17

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build became unstable: PDFBox » PDFBox-1.8.x #17

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300722#comment-17300722
 ] 

Tilman Hausherr commented on PDFBOX-5127:
-

I added a minimal test because when I replaced POTENTIAL_FORMATS with an empty 
array the build passed without a single failure.

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300720#comment-17300720
 ] 

ASF subversion and git services commented on PDFBOX-5127:
-

Commit 1887563 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887563 ]

PDFBOX-5127: add minimal test to test happy path, and to make sure that the 
NumberFormatException is hit too

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300589#comment-17300589
 ] 

Tim Allison commented on PDFBOX-5127:
-

My personal pref would be to generate SimpleDateFormat objects as needed.  The 
good news either way (maybe?) is that this is in an exception handling bit, and 
I don't think I've seen it before so it should be pretty rare???

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300579#comment-17300579
 ] 

Tilman Hausherr edited comment on PDFBOX-5127 at 3/12/21, 7:50 PM:
---

Thanks for the explanation!
So either we synchronize access to the {{NumberFormatException}} segment (which 
will make it slower), or we generate SimpleDateFormat objects when needed 
(which will make it slower).


was (Author: tilman):
Thanks for the explanation!
So either we synchronize access to this segment (which will make it slower), or 
we generate SimpleDateFormat objects when needed (which will make it slower).

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300579#comment-17300579
 ] 

Tilman Hausherr commented on PDFBOX-5127:
-

Thanks for the explanation!
So either we synchronize access to this segment (which will make it slower), or 
we generate SimpleDateFormat objects when needed (which will make it slower).

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-5129) 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()

2021-03-12 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5129.
-
Fix Version/s: 1.8.17
 Assignee: Tilman Hausherr
   Resolution: Fixed

> 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()
> --
>
> Key: PDFBOX-5129
> URL: https://issues.apache.org/jira/browse/PDFBOX-5129
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.16
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 1.8.17
>
>
> {noformat}
> java.lang.ExceptionInInitializerError: null
>   at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
>   at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
>   at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
>   at 
> com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
>   at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
>   at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
>   at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
>   at 
> org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
> {noformat}
> I'll try with higher versions of icu4j.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5129) 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300563#comment-17300563
 ] 

ASF subversion and git services commented on PDFBOX-5129:
-

Commit 1887548 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887548 ]

PDFBOX-5129: update icu4j to highest version that works with jdk6

> 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()
> --
>
> Key: PDFBOX-5129
> URL: https://issues.apache.org/jira/browse/PDFBOX-5129
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.16
>Reporter: Tilman Hausherr
>Priority: Major
>
> {noformat}
> java.lang.ExceptionInInitializerError: null
>   at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
>   at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
>   at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
>   at 
> com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
>   at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
>   at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
>   at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
>   at 
> org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
> {noformat}
> I'll try with higher versions of icu4j.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300552#comment-17300552
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1887547 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1887547 ]

PDFBOX-4892: update owasp plugin

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300551#comment-17300551
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1887546 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1887546 ]

PDFBOX-4892: update owasp plugin

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to stable : PDFBox » PDFBox-1.8.x #15

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to stable : PDFBox » PDFBox-1.8.x » Apache PDFBox #15

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5129) 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300425#comment-17300425
 ] 

ASF subversion and git services commented on PDFBOX-5129:
-

Commit 1887541 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887541 ]

PDFBOX-5129: update icu4j and put version in parent pom.xml

> 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()
> --
>
> Key: PDFBOX-5129
> URL: https://issues.apache.org/jira/browse/PDFBOX-5129
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.16
>Reporter: Tilman Hausherr
>Priority: Major
>
> {noformat}
> java.lang.ExceptionInInitializerError: null
>   at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
>   at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
>   at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
>   at 
> com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
>   at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
>   at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
>   at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
>   at 
> org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
> {noformat}
> I'll try with higher versions of icu4j.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5129) 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()

2021-03-12 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5129:

Description: 
{noformat}
java.lang.ExceptionInInitializerError: null
at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
at 
com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
at 
org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
{noformat}

I'll try with higher versions of icu4j.

  was:
{noformat}
java.lang.ExceptionInInitializerError: null
at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
at 
com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
at 
org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
{noformat}



> 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()
> --
>
> Key: PDFBOX-5129
> URL: https://issues.apache.org/jira/browse/PDFBOX-5129
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.16
>Reporter: Tilman Hausherr
>Priority: Major
>
> {noformat}
> java.lang.ExceptionInInitializerError: null
>   at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
>   at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
>   at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
>   at 
> com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
>   at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
>   at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
>   at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
>   at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
>   at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
>   at 
> org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
> {noformat}
> I'll try with higher versions of icu4j.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-03-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300420#comment-17300420
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1887540 from Tilman Hausherr in branch 'pdfbox/branches/1.8'
[ https://svn.apache.org/r1887540 ]

PDFBOX-4892: update junit

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5129) 1.8 build test fails in com.ibm.icu.util.VersionInfo.getInstance()

2021-03-12 Thread Tilman Hausherr (Jira)
Tilman Hausherr created PDFBOX-5129:
---

 Summary: 1.8 build test fails in 
com.ibm.icu.util.VersionInfo.getInstance()
 Key: PDFBOX-5129
 URL: https://issues.apache.org/jira/browse/PDFBOX-5129
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.16
Reporter: Tilman Hausherr


{noformat}
java.lang.ExceptionInInitializerError: null
at com.ibm.icu.util.VersionInfo.getInstance(VersionInfo.java:191)
at com.ibm.icu.impl.ICUDebug.getInstanceLenient(ICUDebug.java:65)
at com.ibm.icu.impl.ICUDebug.(ICUDebug.java:69)
at 
com.ibm.icu.impl.NormalizerDataReader.(NormalizerDataReader.java:300)
at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:288)
at com.ibm.icu.impl.NormalizerImpl.(NormalizerImpl.java:35)
at com.ibm.icu.text.Normalizer.compose(Normalizer.java:873)
at com.ibm.icu.text.Normalizer$NFKCMode.normalize(Normalizer.java:469)
at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:)
at com.ibm.icu.text.Normalizer.normalize(Normalizer.java:1213)
at org.apache.pdfbox.util.ICU4JImpl.normalizePres(ICU4JImpl.java:112)
at 
org.apache.pdfbox.util.TextNormalize.normalizePres(TextNormalize.java:140)
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread beat weisskopf (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300418#comment-17300418
 ] 

beat weisskopf commented on PDFBOX-5128:


Maybe related, "Zugferd" (for e-invoices) also uses a custom XMP schema. 
https://www.mustangproject.org/ is based on Pdfbox already, there might be some 
samples to be found there.

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300366#comment-17300366
 ] 

Maruan Sahyoun commented on PDFBOX-5128:


Yes, please

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300365#comment-17300365
 ] 

Tim Allison commented on PDFBOX-5128:
-

I’ll scrape xmp out of our regression corpus. I should retain the packet 
envelope?

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.22 vs 2.0.23

2021-03-12 Thread sahy...@fileaffairs.de
Am Freitag, dem 12.03.2021 um 08:15 -0500 schrieb Tim Allison:
> > would it make sense to add that support? If yes could we get samples
> > of
> > various schema to support that development? Could look into that if
> > we
> > think that's worth the effort
> 
> I think I can find some XMPs if they'd be of any use! :D

That would be great - maybe together with expected extraction results -
so I can start with proper unit tests. If you could add to

https://issues.apache.org/jira/browse/PDFBOX-5128

that would be great.

BR

> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 

-- 
-- 
Maruan Sahyoun



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread Maruan Sahyoun (Jira)
Maruan Sahyoun created PDFBOX-5128:
--

 Summary: Support parsing non standardized XMP 
 Key: PDFBOX-5128
 URL: https://issues.apache.org/jira/browse/PDFBOX-5128
 Project: PDFBox
  Issue Type: Task
  Components: XmpBox
Reporter: Maruan Sahyoun
Assignee: Maruan Sahyoun


XMP currently only supports parsing known XMP schema as has been discussed. 
That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.22 vs 2.0.23

2021-03-12 Thread Tim Allison
> would it make sense to add that support? If yes could we get samples of
> various schema to support that development? Could look into that if we
> think that's worth the effort

I think I can find some XMPs if they'd be of any use! :D

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: 2.0.22 vs 2.0.23

2021-03-12 Thread Tim Allison
Many, many thanks to Tilman for running the regression tests!

The 2 new exceptions are caused by PDFBOX-5127.  I'm baffled that we
haven't seen these before, but they do require some rare
circumstances.

The 1 new Tika exception is a zero-byte file exception.  This is my
fault because I changed the files between Tilman's runs.

As for XMPBox, Tilman is right that when I tried to use it many years
ago, it did not have the flexibility needed for PDFs in the wild.
See: 
https://lucene.472066.n3.nabble.com/DISCUSS-options-for-XMP-parsing-td4262520.html

2016 me: "I found that it fails on roughly 40% of XMPs I pulled out of
PDFs from govdocs1/commoncrawl"

Cheers,

 Tim

On Thu, Mar 11, 2021 at 1:34 PM Tilman Hausherr  wrote:
>
> Am 11.03.2021 um 09:00 schrieb sahy...@fileaffairs.de:
> >> The three new exceptions weren't in earlier reports.
> >>
> >> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
> >> is
> >> a non standard schema.
> > would it make sense to add that support? If yes could we get samles of
> > various schema to support that development? Could look into that if we
> > think that's worth the effort
>
>
> Here's an example:
>
> https://issues.apache.org/jira/browse/PDFBOX-3440
>
>
> Tilman
>
>
>
> >
> > Maruan
> >
> >
> >> Tilman
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5127:
---

 Summary: Multithreading issue in JempBox's DateConverter
 Key: PDFBOX-5127
 URL: https://issues.apache.org/jira/browse/PDFBOX-5127
 Project: PDFBox
  Issue Type: Bug
Reporter: Tim Allison


[~tilman] recently found an exception thrown from here 
([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
 in one run of tika-eval but not in another. 

 

This is a multithreading issue caused by 
[https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
 SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
earlier, but so it goes.

 

Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[GitHub] [pdfbox] valerybokov commented on pull request #107: potential memory leaks and small performance improvements

2021-03-12 Thread GitBox


valerybokov commented on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-797343664


   Looks like the variable byteRange can be null inside the 
PDSignature.getByteRange or need comment



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[GitHub] [pdfbox] valerybokov commented on pull request #107: potential memory leaks and small performance improvements

2021-03-12 Thread GitBox


valerybokov commented on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-797325489


   A bit strange implementation of COSArray.getInt (...) and others. If the 
index is larger than the size of the array, then defaultValue will be returned 
instead of throwing an IndexOutOfRangeException. Maybe defaultValue should be 
returned if obj is not COSNumber?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is still unstable: PDFBox » PDFBox-1.8.x #14

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is still unstable: PDFBox » PDFBox-1.8.x » Apache PDFBox #14

2021-03-12 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org