[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766567#comment-17766567
 ] 

Tim Allison commented on PDFBOX-5682:
-

Wow. Thank you!

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.1 PDFBox, 4.0.0
>
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228
 ] 

Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM:
--

This is the part from that document that is, erm, eye-opening:

{noformat}
4.2 AF entry not in the catalog
4.2.1 General
Most existing applications that take advantage of Associated Files use the AF 
entry in the
document catalog as the place to make the association. However, the concept of
Associated Files goes well beyond association only with the file as a whole, 
and also
allows for defining relations between embedded files and certain pages, 
annotations,
form fields, graphics objects, structure elements in the tagging structure, 
DParts or any
other PDF object.
{noformat}

And, yes, the document goes on to say, PDF writers should do the traditional 
thing, but...



was (Author: talli...@mitre.org):
This is the part from that document that is, erm, eye-opening:

{noformat}
4.2 AF entry not in the catalog
4.2.1 General
Most existing applications that take advantage of Associated Files use the AF 
entry in the
document catalog as the place to make the association. However, the concept of
Associated Files goes well beyond association only with the file as a whole, 
and also
allows for defining relations between embedded files and certain pages, 
annotations,
form fields, graphics objects, structure elements in the tagging structure, 
DParts or any
other PDF object.
{noformat}

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225
 ] 

Tim Allison edited comment on PDFBOX-5682 at 9/12/23 2:41 PM:
--

Thank you, [~lehmi].  In Tika, we initially copied PDFBox's 
ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached 
files/file specs/associated files on pretty much anything 
(https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . 

>From what we can tell with publicly available corpora, it is rare to have an 
>attachment not in the name tree and not in an annotation on a page, but after 
>making the change in TIKA-4012, we did find a few new attachments.

This may be a "won't fix" in 3.x. 

Perhaps we allow users to turn off the "scan every object for an embedded file" 
on the Tika side?


was (Author: talli...@mitre.org):
Thank you, [~lehmi].  In Tika, we initially copied PDFBox's 
ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached 
files/file specs/associated files on pretty much anything 
(https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . 

>From what we can tell with publicly available corpora, it is rare to have an 
>attachment not in the name tree and not in an annotation on a page, but after 
>making the change in TIKA-4012, we did find a few new attachments.

This may be a "won't fix" in 3.x. 

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764228#comment-17764228
 ] 

Tim Allison commented on PDFBOX-5682:
-

This is the part from that document that is, erm, eye-opening:

{noformat}
4.2 AF entry not in the catalog
4.2.1 General
Most existing applications that take advantage of Associated Files use the AF 
entry in the
document catalog as the place to make the association. However, the concept of
Associated Files goes well beyond association only with the file as a whole, 
and also
allows for defining relations between embedded files and certain pages, 
annotations,
form fields, graphics objects, structure elements in the tagging structure, 
DParts or any
other PDF object.
{noformat}

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764225#comment-17764225
 ] 

Tim Allison commented on PDFBOX-5682:
-

Thank you, [~lehmi].  In Tika, we initially copied PDFBox's 
ExtractEmbeddedFiles example, but we found that PDF writers can stuff attached 
files/file specs/associated files on pretty much anything 
(https://www.pdfa.org/wp-content/uploads/2018/10/PDF20_AN002-AF.pdf) . 

>From what we can tell with publicly available corpora, it is rare to have an 
>attachment not in the name tree and not in an annotation on a page, but after 
>making the change in TIKA-4012, we did find a few new attachments.

This may be a "won't fix" in 3.x. 

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763903#comment-17763903
 ] 

Tim Allison commented on PDFBOX-5682:
-

Both files spend quite a bit of time in "parseObjectDynamically" when I call 
this:

PDDocument document = Loader.loadPDF(path.toFile());
List objs = 
document.getDocument().getObjectsByType(COSName.FILESPEC);


> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763904#comment-17763904
 ] 

Tim Allison commented on PDFBOX-5682:
-

It looks like that causes a full parse of the file?

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5682) Long/permanent hang in PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5682:

Summary: Long/permanent hang in PDFBox 3.x  (was: Long/permanent hang i n 
PDFBox 3.x)

> Long/permanent hang in PDFBox 3.x
> -
>
> Key: PDFBOX-5682
> URL: https://issues.apache.org/jira/browse/PDFBOX-5682
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
>
> I found two files in the regression tests where we're now getting timeouts at 
> 3 minutes where we weren't before.  Unfortunately, PDFBox's export:text works 
> on both, so it is probably another structural feature, perhaps a problem in 
> Tika?
> This file halts after printing out the header for Table 19 on page 46: 
> https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf
> Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
> invalid or missing type null, but it does finish quickly."
> This file halts after extracting {{"854,793,592"}}: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY
> Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5682) Long/permanent hang i n PDFBox 3.x

2023-09-11 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5682:
---

 Summary: Long/permanent hang i n PDFBox 3.x
 Key: PDFBOX-5682
 URL: https://issues.apache.org/jira/browse/PDFBOX-5682
 Project: PDFBox
  Issue Type: Bug
Reporter: Tim Allison


I found two files in the regression tests where we're now getting timeouts at 3 
minutes where we weren't before.  Unfortunately, PDFBox's export:text works on 
both, so it is probably another structural feature, perhaps a problem in Tika?

This file halts after printing out the header for Table 19 on page 46: 
https://corpora.tika.apache.org/base/docs/govdocs1/078/078656.pdf

Pure PDFBox's export:text complains multiple times: "Page skipped due to an 
invalid or missing type null, but it does finish quickly."

This file halts after extracting {{"854,793,592"}}: 
https://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/G7/G7BO7PNCCREVF2BCY5YSYOPYDLMBYASY

Pure PDFBox's export:text processes this without problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763759#comment-17763759
 ] 

Tim Allison commented on PDFBOX-5681:
-

When I run the demo code in PDFBox trunk with logging on, I see this in the log 
before the new exception.  Further, when running debug in the PDFBox project, I 
can confirm that the xrefTable is somehow being modified during the iteration 
of the objects.

{noformat}
11.09.2023 10:49:07 ERROR cos.COSObject:126 - Can't dereference COSObject{5, 0}
java.io.IOException: Wrong type of referenced length object COSObject{6, 0}: 
COSDictionary
at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:845) 
~[classes/:?]
at 
org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:875) 
~[classes/:?]
at 
org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:710) 
~[classes/:?]
at 
org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:631)
 ~[classes/:?]
at 
org.apache.pdfbox.pdfparser.COSParser.dereferenceCOSObject(COSParser.java:586) 
~[classes/:?]
at org.apache.pdfbox.cos.COSObject.getObject(COSObject.java:121) 
~[classes/:?]
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:257) 
~[classes/:?]
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240) 
~[classes/:?]
at 
org.apache.pdfbox.TestConcurrentModification.oneOff(TestConcurrentModification.java:18)
 ~[test-classes/:?]
...
{noformat}

> ConcurrentModificationException in getObjectsByType() in 3.x
> 
>
> Key: PDFBOX-5681
> URL: https://issues.apache.org/jira/browse/PDFBOX-5681
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-3714-2.pdf
>
>
> [~tilman]'s regression testing turned up this exception when we integrate 
> PDFBox 3.0.0 into Tika:
> {noformat}
> java.util.ConcurrentModificationException
>   at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
>   at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
> {noformat}
> I can replicate this exception consistently on the attached file.
> With this code:
> {noformat}
> Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
> PDDocument document = Loader.loadPDF(path.toFile());
> List objs = 
> document.getDocument().getObjectsByType(COSName.FILESPEC);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763754#comment-17763754
 ] 

Tim Allison commented on PDFBOX-5681:
-

I initially thought this was a threading issue, but it isn't.  The exception 
can be thrown if any modification is made to the underlying collection while 
the iterator is iterating, even if in the same thread.

My guess is that the computeIfAbsent call in {{getObjectFromPool}} is somehow 
changing the xRefTable keyset that is being iterated over???

There may be another iteration + modification on a different collection during 
the parse.  The triggering object {{5 0 R}} requires parsing numerous objects 
from an xrefstream.




> ConcurrentModificationException in getObjectsByType() in 3.x
> 
>
> Key: PDFBOX-5681
> URL: https://issues.apache.org/jira/browse/PDFBOX-5681
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-3714-2.pdf
>
>
> [~tilman]'s regression testing turned up this exception when we integrate 
> PDFBox 3.0.0 into Tika:
> {noformat}
> java.util.ConcurrentModificationException
>   at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
>   at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
> {noformat}
> I can replicate this exception consistently on the attached file.
> With this code:
> {noformat}
> Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
> PDDocument document = Loader.loadPDF(path.toFile());
> List objs = 
> document.getDocument().getObjectsByType(COSName.FILESPEC);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5681:
---

 Summary: ConcurrentModificationException in getObjectsByType() in 
3.x
 Key: PDFBOX-5681
 URL: https://issues.apache.org/jira/browse/PDFBOX-5681
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison
 Attachments: PDFBOX-3714-2.pdf

[~tilman]'s regression testing turned up this exception when we integrate 
PDFBox 3.0.0 into Tika:

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
{noformat}

I can replicate this exception consistently on this file: 

With this code:
{noformat}
Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
PDDocument document = Loader.loadPDF(path.toFile());
List objs = 
document.getDocument().getObjectsByType(COSName.FILESPEC);
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5681:

Affects Version/s: 3.0.0 PDFBox

> ConcurrentModificationException in getObjectsByType() in 3.x
> 
>
> Key: PDFBOX-5681
> URL: https://issues.apache.org/jira/browse/PDFBOX-5681
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-3714-2.pdf
>
>
> [~tilman]'s regression testing turned up this exception when we integrate 
> PDFBox 3.0.0 into Tika:
> {noformat}
> java.util.ConcurrentModificationException
>   at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
>   at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
> {noformat}
> I can replicate this exception consistently on this file: 
> With this code:
> {noformat}
> Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
> PDDocument document = Loader.loadPDF(path.toFile());
> List objs = 
> document.getDocument().getObjectsByType(COSName.FILESPEC);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5681:

Description: 
[~tilman]'s regression testing turned up this exception when we integrate 
PDFBox 3.0.0 into Tika:

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
{noformat}

I can replicate this exception consistently on the attached file.

With this code:
{noformat}
Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
PDDocument document = Loader.loadPDF(path.toFile());
List objs = 
document.getDocument().getObjectsByType(COSName.FILESPEC);
{noformat}

  was:
[~tilman]'s regression testing turned up this exception when we integrate 
PDFBox 3.0.0 into Tika:

{noformat}
java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
at 
org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
{noformat}

I can replicate this exception consistently on this file: 

With this code:
{noformat}
Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
PDDocument document = Loader.loadPDF(path.toFile());
List objs = 
document.getDocument().getObjectsByType(COSName.FILESPEC);
{noformat}


> ConcurrentModificationException in getObjectsByType() in 3.x
> 
>
> Key: PDFBOX-5681
> URL: https://issues.apache.org/jira/browse/PDFBOX-5681
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-3714-2.pdf
>
>
> [~tilman]'s regression testing turned up this exception when we integrate 
> PDFBox 3.0.0 into Tika:
> {noformat}
> java.util.ConcurrentModificationException
>   at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
>   at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
> {noformat}
> I can replicate this exception consistently on the attached file.
> With this code:
> {noformat}
> Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
> PDDocument document = Loader.loadPDF(path.toFile());
> List objs = 
> document.getDocument().getObjectsByType(COSName.FILESPEC);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5681) ConcurrentModificationException in getObjectsByType() in 3.x

2023-09-11 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5681:

Issue Type: Bug  (was: Task)

> ConcurrentModificationException in getObjectsByType() in 3.x
> 
>
> Key: PDFBOX-5681
> URL: https://issues.apache.org/jira/browse/PDFBOX-5681
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-3714-2.pdf
>
>
> [~tilman]'s regression testing turned up this exception when we integrate 
> PDFBox 3.0.0 into Tika:
> {noformat}
> java.util.ConcurrentModificationException
>   at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
>   at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1620)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:254)
>   at 
> org.apache.pdfbox.cos.COSDocument.getObjectsByType(COSDocument.java:240)
> {noformat}
> I can replicate this exception consistently on this file: 
> With this code:
> {noformat}
> Path path = Paths.get("/.../PDFBOX-3714-2.pdf");
> PDDocument document = Loader.loadPDF(path.toFile());
> List objs = 
> document.getDocument().getObjectsByType(COSName.FILESPEC);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5595) Slight regression on corrupt bug tracker file

2023-05-05 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5595:

Description: 
I'm not sure this is a regression, and apologies if you already dealt with this 
before the release of 2.0.28.  Also, as a warning, this file is corrupt.

 

We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: 
[https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf]

 

This file derived from the evince bug tracker, which now eventually links to 
this issue:

[https://gitlab.freedesktop.org/poppler/poppler/-/issues/323]

 

This image from the poppler issue shows what we get with PDFBox 2.0.28 on the 
left, and 2.0.27 on the right.

 

If the decision is "the file is corrupt -> not going to fix", I completely 
understand.

!https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG!

  was:
I'm not sure this is a regression, and apologies if you already dealt with this 
before the release of 2.0.28.  Also, as a warning, this file is corrupt.

 

We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: 
[https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf]

 

This file derived from the evince bug tracker, which now eventually links to 
this issue:

[https://gitlab.freedesktop.org/poppler/poppler/-/issues/323]

 

This image shows what we get with PDFBox 2.0.28 on the left, and 2.0.27 on the 
right.

 

If the decision is "the file is corrupt -> not going to fix", I completely 
understand.

!https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG!


> Slight regression on corrupt bug tracker file
> -
>
> Key: PDFBOX-5595
> URL: https://issues.apache.org/jira/browse/PDFBOX-5595
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> I'm not sure this is a regression, and apologies if you already dealt with 
> this before the release of 2.0.28.  Also, as a warning, this file is corrupt.
>  
> We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: 
> [https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf]
>  
> This file derived from the evince bug tracker, which now eventually links to 
> this issue:
> [https://gitlab.freedesktop.org/poppler/poppler/-/issues/323]
>  
> This image from the poppler issue shows what we get with PDFBox 2.0.28 on the 
> left, and 2.0.27 on the right.
>  
> If the decision is "the file is corrupt -> not going to fix", I completely 
> understand.
> !https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5595) Slight regression on corrupt bug tracker file

2023-05-05 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5595:
---

 Summary: Slight regression on corrupt bug tracker file
 Key: PDFBOX-5595
 URL: https://issues.apache.org/jira/browse/PDFBOX-5595
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison


I'm not sure this is a regression, and apologies if you already dealt with this 
before the release of 2.0.28.  Also, as a warning, this file is corrupt.

 

We used to get more text out of this file in 2.0.27 than we do now in 2.0.28: 
[https://corpora.tika.apache.org/base/docs/bug_trackers/evince/evince-395-0.zip-0.pdf]

 

This file derived from the evince bug tracker, which now eventually links to 
this issue:

[https://gitlab.freedesktop.org/poppler/poppler/-/issues/323]

 

This image shows what we get with PDFBox 2.0.28 on the left, and 2.0.27 on the 
right.

 

If the decision is "the file is corrupt -> not going to fix", I completely 
understand.

!https://gitlab.gnome.org/GNOME/evince/uploads/0bc2302dbafc0bbc2110f0d42951428e/evince.JPG!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5550) reduce number of open files

2022-12-05 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5550:

Summary: reduce number of open files  (was: redcuce number of open files)

> reduce number of open files
> ---
>
> Key: PDFBOX-5550
> URL: https://issues.apache.org/jira/browse/PDFBOX-5550
> Project: PDFBox
>  Issue Type: Improvement
>  Components: IO
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
>
> {{org.apache.pdfbox.io.RandomAccessReadBufferedFile}} creates a new instance 
> of {}org.apache.pdfbox.io.RandomAccessReadBufferedFile{} which opens 
> a new file using the underlying file every time when creating a new view. The 
> view of a COSStream isn't most likely closed until the entire pdf is closed. 
> In the end there are as many open files as created COSStreams until the pdf 
> is closed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5540) export:text creates jibberish / malformed output

2022-11-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635337#comment-17635337
 ] 

Tim Allison commented on PDFBOX-5540:
-

Should I kick that off now?

> export:text creates jibberish / malformed output
> 
>
> Key: PDFBOX-5540
> URL: https://issues.apache.org/jira/browse/PDFBOX-5540
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.16, 2.0.27, 3.0.0 PDFBox
> Environment: Same on Windows, Linux and macOS
>Reporter: Alfons
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: regression
> Fix For: 2.0.28, 3.0.0 PDFBox
>
> Attachments: PDFBOX-5540.pdf.txt, test.pdf, test.txt
>
>
> Using PDFBox as part of Tika and having issues with some PDFs outputting 
> unreadable content. Copying text from Adobe / macOS Preview / Browsers works 
> as expected.
> I have also tried "re-encoding" the PDF by editing and saving it with 
> Acrobat, thinking it could be an issue with their original PDF creator and 
> using pdfbox with different encodings, but output mostly remained unchanged.
> I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
> {code:java}
> root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf          
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Invalid ToUnicode CMap in font 
> Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNUNG: Using predefined identity CMap instead {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5501) Jempbox is slow on xmp with large event histories

2022-09-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602789#comment-17602789
 ] 

Tim Allison commented on PDFBOX-5501:
-

Thank you!

> Jempbox is slow on xmp with large event histories
> -
>
> Key: PDFBOX-5501
> URL: https://issues.apache.org/jira/browse/PDFBOX-5501
> Project: PDFBox
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Minor
> Attachments: big.xmp.gz
>
>
> In looking at the timeouts in a recent run against 8 million PDFs, I found 
> one file where the processing time was caused by extremely slow parsing of 
> the media management schema.
> If I do enough subclassing and put a hard limit inside 
> getEventSequenceList(), the processing time is fairly quick.
> I realize that Jempbox is not going to be supported going forward and 
> understand if this is a "do not fix".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-5501) Jempbox is slow on xmp with large event histories

2022-09-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved PDFBOX-5501.
-
Resolution: Not A Problem

 Y. I just also confirmed that this is fixed in 1.8.17-SNAPSHOT.  
Sorry about that.  Thank you.  Any plans for the 1.8.17 release? 

> Jempbox is slow on xmp with large event histories
> -
>
> Key: PDFBOX-5501
> URL: https://issues.apache.org/jira/browse/PDFBOX-5501
> Project: PDFBox
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Minor
> Attachments: big.xmp.gz
>
>
> In looking at the timeouts in a recent run against 8 million PDFs, I found 
> one file where the processing time was caused by extremely slow parsing of 
> the media management schema.
> If I do enough subclassing and put a hard limit inside 
> getEventSequenceList(), the processing time is fairly quick.
> I realize that Jempbox is not going to be supported going forward and 
> understand if this is a "do not fix".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5501) Jempbox is slow on xmp with large event histories

2022-09-08 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5501:
---

 Summary: Jempbox is slow on xmp with large event histories
 Key: PDFBOX-5501
 URL: https://issues.apache.org/jira/browse/PDFBOX-5501
 Project: PDFBox
  Issue Type: Wish
Reporter: Tim Allison
 Attachments: big.xmp.gz

In looking at the timeouts in a recent run against 8 million PDFs, I found one 
file where the processing time was caused by extremely slow parsing of the 
media management schema.

If I do enough subclassing and put a hard limit inside getEventSequenceList(), 
the processing time is fairly quick.

I realize that Jempbox is not going to be supported going forward and 
understand if this is a "do not fix".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578904#comment-17578904
 ] 

Tim Allison commented on PDFBOX-5490:
-

Y.  Completely understand.  I don't want to impede 3.0.0.   Thank you!  
  

> Add reconstruction information to the PDDocument
> 
>
> Key: PDFBOX-5490
> URL: https://issues.apache.org/jira/browse/PDFBOX-5490
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Reporter: Tim Allison
>Priority: Minor
>
> When the xref has to be rebuilt or there are other anomalies in the parsing 
> of the PDDocument, the results are currently logged.  In a multithreaded 
> environment it is not easy to reconstruct which documents had which problems.
> It would be helpful if a PDF was able to be successfully loaded to include 
> information about what had to be fixed in order to load it successfully.  
> Certainly, rebuilding the xref table comes to mind, but any other info would 
> also be useful.
> This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-11 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578510#comment-17578510
 ] 

Tim Allison commented on PDFBOX-5490:
-

My initial request would be for whether or not the xref table had to be 
rebuilt...largely because I'm somewhat interested in that at the moment. 

Any info at the pre-DOM stage for what had to be guessed or assumed -- alleged 
obj stream length != actual object stream.

Other places where PDFBox currently logs warnings (missing font, missing 
unicode mappings etc) after the DOM has been built would also be useful.

> Add reconstruction information to the PDDocument
> 
>
> Key: PDFBOX-5490
> URL: https://issues.apache.org/jira/browse/PDFBOX-5490
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Reporter: Tim Allison
>Priority: Minor
>
> When the xref has to be rebuilt or there are other anomalies in the parsing 
> of the PDDocument, the results are currently logged.  In a multithreaded 
> environment it is not easy to reconstruct which documents had which problems.
> It would be helpful if a PDF was able to be successfully loaded to include 
> information about what had to be fixed in order to load it successfully.  
> Certainly, rebuilding the xref table comes to mind, but any other info would 
> also be useful.
> This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578129#comment-17578129
 ] 

Tim Allison commented on PDFBOX-5490:
-

Oh, that looks great.

> Add reconstruction information to the PDDocument
> 
>
> Key: PDFBOX-5490
> URL: https://issues.apache.org/jira/browse/PDFBOX-5490
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Reporter: Tim Allison
>Priority: Minor
>
> When the xref has to be rebuilt or there are other anomalies in the parsing 
> of the PDDocument, the results are currently logged.  In a multithreaded 
> environment it is not easy to reconstruct which documents had which problems.
> It would be helpful if a PDF was able to be successfully loaded to include 
> information about what had to be fixed in order to load it successfully.  
> Certainly, rebuilding the xref table comes to mind, but any other info would 
> also be useful.
> This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578055#comment-17578055
 ] 

Tim Allison commented on PDFBOX-5490:
-

A Listener would be great.  Any mechanism that would allow programmatic 
retrieval of problems encountered during the parse per file.

> Add reconstruction information to the PDDocument
> 
>
> Key: PDFBOX-5490
> URL: https://issues.apache.org/jira/browse/PDFBOX-5490
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Reporter: Tim Allison
>Priority: Minor
>
> When the xref has to be rebuilt or there are other anomalies in the parsing 
> of the PDDocument, the results are currently logged.  In a multithreaded 
> environment it is not easy to reconstruct which documents had which problems.
> It would be helpful if a PDF was able to be successfully loaded to include 
> information about what had to be fixed in order to load it successfully.  
> Certainly, rebuilding the xref table comes to mind, but any other info would 
> also be useful.
> This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5490:

Component/s: Parsing

> Add reconstruction information to the PDDocument
> 
>
> Key: PDFBOX-5490
> URL: https://issues.apache.org/jira/browse/PDFBOX-5490
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Reporter: Tim Allison
>Priority: Minor
>
> When the xref has to be rebuilt or there are other anomalies in the parsing 
> of the PDDocument, the results are currently logged.  In a multithreaded 
> environment it is not easy to reconstruct which documents had which problems.
> It would be helpful if a PDF was able to be successfully loaded to include 
> information about what had to be fixed in order to load it successfully.  
> Certainly, rebuilding the xref table comes to mind, but any other info would 
> also be useful.
> This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5490:
---

 Summary: Add reconstruction information to the PDDocument
 Key: PDFBOX-5490
 URL: https://issues.apache.org/jira/browse/PDFBOX-5490
 Project: PDFBox
  Issue Type: Wish
Reporter: Tim Allison


When the xref has to be rebuilt or there are other anomalies in the parsing of 
the PDDocument, the results are currently logged.  In a multithreaded 
environment it is not easy to reconstruct which documents had which problems.

It would be helpful if a PDF was able to be successfully loaded to include 
information about what had to be fixed in order to load it successfully.  
Certainly, rebuilding the xref table comes to mind, but any other info would 
also be useful.

This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5431:

Description: 
I noticed a new NPE in one of our test files on Tika when I recently built 
PDFBox's trunk.  I've attached the file.

If I don't set strict parsing to false, the parse works.


{noformat}
DomXmpParser xmpParser = new DomXmpParser();
xmpParser.setStrictParsing(false);
Path p = Paths.get(".../metadata.xml");
try (InputStream is = Files.newInputStream(p)) {
XMPMetadata metadata = xmpParser.parse(is);
for (XMPSchema schema : metadata.getAllSchemas()) {
for (AbstractField f : schema.getAllProperties()) {
System.out.println(f);
}
}
}
{noformat}

Stack
{noformat}
ava.lang.NullPointerException
at 
org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608)
at 
org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529)
at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487)
at 
org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352)
at 
org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319)
at 
org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248)
at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)
at 
org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81)
{noformat}

  was:
I noticed a new NPE in one of our test files on Tika when I recently built 
PDFBox's trunk.  I've attached the file.

If I don't set strict parsing to false, the parse works.


{noformat}
DomXmpParser xmpParser = new DomXmpParser();
xmpParser.setStrictParsing(false);
Path p = Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml");
try (InputStream is = Files.newInputStream(p)) {
XMPMetadata metadata = xmpParser.parse(is);
for (XMPSchema schema : metadata.getAllSchemas()) {
for (AbstractField f : schema.getAllProperties()) {
System.out.println(f);
}
}
}
{noformat}

Stack
{noformat}
ava.lang.NullPointerException
at 
org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608)
at 
org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529)
at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487)
at 
org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352)
at 
org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319)
at 
org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248)
at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)
at 
org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81)
{noformat}


> New NPE in xmpbox parser in trunk
> -
>
> Key: PDFBOX-5431
> URL: https://issues.apache.org/jira/browse/PDFBOX-5431
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Major
> Attachments: metadata.xml
>
>
> I noticed a new NPE in one of our test files on Tika when I recently built 
> PDFBox's trunk.  I've attached the file.
> If I don't set strict parsing to false, the parse works.
> {noformat}
> DomXmpParser xmpParser = new DomXmpParser();
> xmpParser.setStrictParsing(false);
> Path p = Paths.get(".../metadata.xml");
> try (InputStream is = Files.newInputStream(p)) {
> XMPMetadata metadata = xmpParser.parse(is);
> for (XMPSchema schema : metadata.getAllSchemas()) {
> for (AbstractField f : schema.getAllProperties()) {
> System.out.println(f);
> }
> }
> }
> {noformat}
> Stack
> {noformat}
> ava.lang.NullPointerException
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529)
>   at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248)
>   at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)
>   at 
> org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81)
> {noformat}



--

[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5431:

Component/s: XmpBox

> New NPE in xmpbox parser in trunk
> -
>
> Key: PDFBOX-5431
> URL: https://issues.apache.org/jira/browse/PDFBOX-5431
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Major
> Attachments: metadata.xml
>
>
> I noticed a new NPE in one of our test files on Tika when I recently built 
> PDFBox's trunk.  I've attached the file.
> If I don't set strict parsing to false, the parse works.
> {noformat}
> DomXmpParser xmpParser = new DomXmpParser();
> xmpParser.setStrictParsing(false);
> Path p = 
> Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml");
> try (InputStream is = Files.newInputStream(p)) {
> XMPMetadata metadata = xmpParser.parse(is);
> for (XMPSchema schema : metadata.getAllSchemas()) {
> for (AbstractField f : schema.getAllProperties()) {
> System.out.println(f);
> }
> }
> }
> {noformat}
> Stack
> {noformat}
> ava.lang.NullPointerException
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529)
>   at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248)
>   at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)
>   at 
> org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5431:

Affects Version/s: 3.0.0 PDFBox

> New NPE in xmpbox parser in trunk
> -
>
> Key: PDFBOX-5431
> URL: https://issues.apache.org/jira/browse/PDFBOX-5431
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Priority: Major
> Attachments: metadata.xml
>
>
> I noticed a new NPE in one of our test files on Tika when I recently built 
> PDFBox's trunk.  I've attached the file.
> If I don't set strict parsing to false, the parse works.
> {noformat}
> DomXmpParser xmpParser = new DomXmpParser();
> xmpParser.setStrictParsing(false);
> Path p = 
> Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml");
> try (InputStream is = Files.newInputStream(p)) {
> XMPMetadata metadata = xmpParser.parse(is);
> for (XMPSchema schema : metadata.getAllSchemas()) {
> for (AbstractField f : schema.getAllProperties()) {
> System.out.println(f);
> }
> }
> }
> {noformat}
> Stack
> {noformat}
> ava.lang.NullPointerException
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529)
>   at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319)
>   at 
> org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248)
>   at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)
>   at 
> org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5431) New NPE in xmpbox parser in trunk

2022-05-10 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5431:
---

 Summary: New NPE in xmpbox parser in trunk
 Key: PDFBOX-5431
 URL: https://issues.apache.org/jira/browse/PDFBOX-5431
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison
 Attachments: metadata.xml

I noticed a new NPE in one of our test files on Tika when I recently built 
PDFBox's trunk.  I've attached the file.

If I don't set strict parsing to false, the parse works.


{noformat}
DomXmpParser xmpParser = new DomXmpParser();
xmpParser.setStrictParsing(false);
Path p = Paths.get("/home/tallison/Desktop/tmp/META-INF/metadata.xml");
try (InputStream is = Files.newInputStream(p)) {
XMPMetadata metadata = xmpParser.parse(is);
for (XMPSchema schema : metadata.getAllSchemas()) {
for (AbstractField f : schema.getAllProperties()) {
System.out.println(f);
}
}
}
{noformat}

Stack
{noformat}
ava.lang.NullPointerException
at 
org.apache.xmpbox.xml.DomXmpParser.parseLiDescription(DomXmpParser.java:608)
at 
org.apache.xmpbox.xml.DomXmpParser.parseLiElement(DomXmpParser.java:529)
at org.apache.xmpbox.xml.DomXmpParser.manageArray(DomXmpParser.java:487)
at 
org.apache.xmpbox.xml.DomXmpParser.createProperty(DomXmpParser.java:352)
at 
org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:319)
at 
org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:248)
at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:201)
at 
org.apache.tika.parser.indesign.IDMLParserTest.testXMP(IDMLParserTest.java:81)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522531#comment-17522531
 ] 

Tim Allison commented on PDFBOX-5415:
-

An answer on the Tika side. Yes, parsing is dangerous and you’ll need to 
isolate at the process level; thread level isolation is not enough. See what we 
offer in Tika for robustness: 
https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=148647830#content/view/148647830

> Infinite loop in ExtractText in 2.x branch on a specific pdf
> 
>
> Key: PDFBOX-5415
> URL: https://issues.apache.org/jira/browse/PDFBOX-5415
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26
>Reporter: Tim Allison
>Priority: Major
> Attachments: PDFBOX-5415-TIKA-3718-p10.pdf
>
>
> [~DavidAvant] reported an infinite loop in Tika and provided an example file. 
>  I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's 
> ExtractText.
> File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf
> Adobe and a slightly out of date pdftotext also have problems with this file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17521382#comment-17521382
 ] 

Tim Allison commented on PDFBOX-5415:
-

Michael Demey's diagnosis: 
https://twitter.com/MyMilkedEek/status/1513990823511273472?s=20

> Infinite loop in ExtractText in 2.x branch on a specific pdf
> 
>
> Key: PDFBOX-5415
> URL: https://issues.apache.org/jira/browse/PDFBOX-5415
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26
>Reporter: Tim Allison
>Priority: Major
> Attachments: PDFBOX-5415-TIKA-3718-p10.pdf
>
>
> [~DavidAvant] reported an infinite loop in Tika and provided an example file. 
>  I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's 
> ExtractText.
> File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf
> Adobe and a slightly out of date pdftotext also have problems with this file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5415:

Affects Version/s: 2.0.26

> Infinite loop in ExtractText in 2.x branch on a specific pdf
> 
>
> Key: PDFBOX-5415
> URL: https://issues.apache.org/jira/browse/PDFBOX-5415
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.26
>Reporter: Tim Allison
>Priority: Major
>
> [~DavidAvant] reported an infinite loop in Tika and provided an example file. 
>  I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's 
> ExtractText.
> File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf
> Adobe and a slightly out of date pdftotext also have problems with this file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5415:

Component/s: Parsing

> Infinite loop in ExtractText in 2.x branch on a specific pdf
> 
>
> Key: PDFBOX-5415
> URL: https://issues.apache.org/jira/browse/PDFBOX-5415
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26
>Reporter: Tim Allison
>Priority: Major
>
> [~DavidAvant] reported an infinite loop in Tika and provided an example file. 
>  I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's 
> ExtractText.
> File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf
> Adobe and a slightly out of date pdftotext also have problems with this file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5415) Infinite loop in ExtractText in 2.x branch on a specific pdf

2022-04-12 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5415:
---

 Summary: Infinite loop in ExtractText in 2.x branch on a specific 
pdf
 Key: PDFBOX-5415
 URL: https://issues.apache.org/jira/browse/PDFBOX-5415
 Project: PDFBox
  Issue Type: Bug
Reporter: Tim Allison


[~DavidAvant] reported an infinite loop in Tika and provided an example file.  
I can reproduce this with the latest PDFBox app 2.0.26-SNAPSHOT's ExtractText.

File: https://issues.apache.org/jira/secure/attachment/13042292/map.pdf

Adobe and a slightly out of date pdftotext also have problems with this file.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-04-07 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved PDFBOX-5396.
-
Fix Version/s: 2.0.26
   Resolution: Fixed

> Add maven enforcer rule to ensure that JAVA_HOME is set
> ---
>
> Key: PDFBOX-5396
> URL: https://issues.apache.org/jira/browse/PDFBOX-5396
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.25
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 2.0.26
>
>
> I recently stubbed my toe on this one again.  At least in the 2.x branch, the 
> module fontbox requires that the JAVA_HOME variable be set.  If it isn't set, 
> the project build fails in fontbox without any meaningful indication as to 
> why, even with the -X option set in maven.
> {noformat}
> (default-compile) on project fontbox: Compilation failure -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile
> (default-compile) on project fontbox: Compilation failure
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> {noformat}
> Also, on our website, there's no mention that JAVA_HOME should be set.  And, 
> yes, I realize that it is set on most developers' systems. :D
> One solution would be to add this rule to the maven-enforcer-plugin 
> configuration in the parent pom:
> {code:java}
> 
>   JAVA_HOME
>   The JAVA_HOME environment variable must be set!
> 
> {code}
> If this is ok, I'll add this rule in 2.x and see if I get the same behavior 
> in trunk.
> Side note: This was probably the cause of: 
> https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few 
> other issues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512474#comment-17512474
 ] 

Tim Allison commented on PDFBOX-5401:
-

bq. Hi, I didn't test these samples on PDFBOX 2.0

Sorry, my comment above was a finding, not a question.

> A carefully crafted pdf can trigger an infinite loop while parsing
> --
>
> Key: PDFBOX-5401
> URL: https://issues.apache.org/jira/browse/PDFBOX-5401
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 3.0.0 PDFBox
> Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic)
>Reporter: Xiaohan Zhang
>Priority: Major
> Attachments: verified.zip
>
>
> Hi, I found a crafted pdf that can trigger an infinite loop while parsing 
> using PDFBOX. I have tested on the latest commit of PDFBOX on Github.
>  
> This bug can be triggered by the following code.
> ```
> File ff = new File("path/to/the/sample");
> PDDocument document = Loader.loadPDF(ff);
> ```
>  
> I found that the root cause of this infinite loop resides in the while-loop 
> at line 321 of  [COSParse.java|#L321].]. When parsing the provided PDF files, 
> the variable $prev is never changed during this loop.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397
 ] 

Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 4:38 PM:
---

I confirmed this behavior with the last 2.0.26-SNAPSHOT I used for regression 
tests (from earlier this week?) with 3 of the 4 files ({{bda2803...}} does not 
cause problems for me).


was (Author: talli...@mitre.org):
Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests 
(from earlier this week?) with 3 of the 4 files ({{bda2803...}} does not cause 
problems for me).

> A carefully crafted pdf can trigger an infinite loop while parsing
> --
>
> Key: PDFBOX-5401
> URL: https://issues.apache.org/jira/browse/PDFBOX-5401
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 3.0.0 PDFBox
> Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic)
>Reporter: Xiaohan Zhang
>Priority: Major
> Attachments: verified.zip
>
>
> Hi, I found a crafted pdf that can trigger an infinite loop while parsing 
> using PDFBOX. I have tested on the latest commit of PDFBOX on Github.
>  
> This bug can be triggered by the following code.
> ```
> File ff = new File("path/to/the/sample");
> PDDocument document = Loader.loadPDF(ff);
> ```
>  
> I found that the root cause of this infinite loop resides in the while-loop 
> at line 321 of  [COSParse.java|#L321].]. When parsing the provided PDF files, 
> the variable $prev is never changed during this loop.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397
 ] 

Tim Allison edited comment on PDFBOX-5401 at 3/25/22, 2:07 PM:
---

Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests 
(from earlier this week?) with 3 of the 4 files ({{bda2803...}} does not cause 
problems for me).


was (Author: talli...@mitre.org):
Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests 
with 3 of the 4 files ({{bda2803...}} does not cause problems for me.

> A carefully crafted pdf can trigger an infinite loop while parsing
> --
>
> Key: PDFBOX-5401
> URL: https://issues.apache.org/jira/browse/PDFBOX-5401
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 3.0.0 PDFBox
> Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic)
>Reporter: Xiaohan Zhang
>Priority: Major
> Attachments: verified.zip
>
>
> Hi, I found a crafted pdf that can trigger an infinite loop while parsing 
> using PDFBOX. I have tested on the latest commit of PDFBOX on Github.
>  
> This bug can be triggered by the following code.
> ```
> File ff = new File("path/to/the/sample");
> PDDocument document = Loader.loadPDF(ff);
> ```
>  
> I found that the root cause of this infinite loop resides in the while-loop 
> at line 321 of  [COSParse.java|#L321].]. When parsing the provided PDF files, 
> the variable $prev is never changed during this loop.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512397#comment-17512397
 ] 

Tim Allison commented on PDFBOX-5401:
-

Can confirm behavior with the last 2.0.26-SNAPSHOT I used for regression tests 
with 3 of the 4 files ({{bda2803...}} does not cause problems for me.

> A carefully crafted pdf can trigger an infinite loop while parsing
> --
>
> Key: PDFBOX-5401
> URL: https://issues.apache.org/jira/browse/PDFBOX-5401
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 3.0.0 PDFBox
> Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic)
>Reporter: Xiaohan Zhang
>Priority: Major
> Attachments: verified.zip
>
>
> Hi, I found a crafted pdf that can trigger an infinite loop while parsing 
> using PDFBOX. I have tested on the latest commit of PDFBOX on Github.
>  
> This bug can be triggered by the following code.
> ```
> File ff = new File("path/to/the/sample");
> PDDocument document = Loader.loadPDF(ff);
> ```
>  
> I found that the root cause of this infinite loop resides in the while-loop 
> at line 321 of  [COSParse.java|#L321].]. When parsing the provided PDF files, 
> the variable $prev is never changed during this loop.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-03-21 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509892#comment-17509892
 ] 

Tim Allison commented on PDFBOX-5396:
-

This is not a problem in trunk.

> Add maven enforcer rule to ensure that JAVA_HOME is set
> ---
>
> Key: PDFBOX-5396
> URL: https://issues.apache.org/jira/browse/PDFBOX-5396
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.25
>Reporter: Tim Allison
>Priority: Trivial
>
> I recently stubbed my toe on this one again.  At least in the 2.x branch, the 
> module fontbox requires that the JAVA_HOME variable be set.  If it isn't set, 
> the project build fails in fontbox without any meaningful indication as to 
> why, even with the -X option set in maven.
> {noformat}
> (default-compile) on project fontbox: Compilation failure -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile
> (default-compile) on project fontbox: Compilation failure
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> {noformat}
> Also, on our website, there's no mention that JAVA_HOME should be set.  And, 
> yes, I realize that it is set on most developers' systems. :D
> One solution would be to add this rule to the maven-enforcer-plugin 
> configuration in the parent pom:
> {code:java}
> 
>   JAVA_HOME
>   The JAVA_HOME environment variable must be set!
> 
> {code}
> If this is ok, I'll add this rule in 2.x and see if I get the same behavior 
> in trunk.
> Side note: This was probably the cause of: 
> https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few 
> other issues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-03-21 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5396:

Description: 
I recently stubbed my toe on this one again.  At least in the 2.x branch, the 
module fontbox requires that the JAVA_HOME variable be set.  If it isn't set, 
the project build fails in fontbox without any meaningful indication as to why, 
even with the -X option set in maven.

{noformat}
(default-compile) on project fontbox: Compilation failure -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile
(default-compile) on project fontbox: Compilation failure
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
{noformat}

Also, on our website, there's no mention that JAVA_HOME should be set.  And, 
yes, I realize that it is set on most developers' systems. :D

One solution would be to add this rule to the maven-enforcer-plugin 
configuration in the parent pom:
{code:java}

  JAVA_HOME
  The JAVA_HOME environment variable must be set!

{code}

If this is ok, I'll add this rule in 2.x and see if I get the same behavior in 
trunk.

Side note: This was probably the cause of: 
https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few 
other issues.

  was:
I recently stubbed my toe on this one again.  At least in the 2.x branch, the 
module fontbox requires that the JAVA_HOME variable be set.  If it isn't set, 
the project build fails in fontbox without any meaningful indication as to why, 
even with the -X option set in maven.

{noformat}
(default-compile) on project fontbox: Compilation failure -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile
(default-compile) on project fontbox: Compilation failure
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
{noformat}

Also, on our website, there's no mention that JAVA_HOME should be set.  And, 
yes, I realize that it is set on most developers' systems. :D

One solution would be to add this rule to the maven-enforcer-plugin 
configuration in the parent pom:
{code:java}
  
JAVA_HOME
The JAVA_HOME environment variable must be 
set!
{code}

If this is ok, I'll add this rule in 2.x and see if I get the same behavior in 
trunk.

Side note: This was probably the cause of: 
https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few 
other issues.


> Add maven enforcer rule to ensure that JAVA_HOME is set
> ---
>
> Key: PDFBOX-5396
> URL: https://issues.apache.org/jira/browse/PDFBOX-5396
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.25
>Reporter: Tim Allison
>Priority: Trivial
>
> I recently stubbed my toe on this one again.  At least in the 2.x branch, the 
> module fontbox requires that the JAVA_HOME variable be set.  If it isn't set, 
> the project build fails in fontbox without any meaningful indication as to 
> why, even with the -X option set in maven.
> {noformat}
> (default-compile) on project fontbox: Compilation failure -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile
> (default-compile) on project fontbox: Compilation failure
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute
> {noformat}
> Also, on our website, there's no mention that JAVA_HOME should be set.  And, 
> yes, I realize that it is set on most developers' systems. :D
> One solution would be to add this rule to the maven-enforcer-plugin 
> configuration in the parent pom:
> {code:java}
> 
>   JAVA_HOME
>   The JAVA_HOME environment variable must be set!
> 
> {code}
> If this is ok, I'll add this rule in 2.x and see if I get the same behavior 
> in trunk.
> Side note: This was probably the cause of: 
> https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few 
> other issues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5396) Add maven enforcer rule to ensure that JAVA_HOME is set

2022-03-21 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5396:
---

 Summary: Add maven enforcer rule to ensure that JAVA_HOME is set
 Key: PDFBOX-5396
 URL: https://issues.apache.org/jira/browse/PDFBOX-5396
 Project: PDFBox
  Issue Type: Task
Affects Versions: 2.0.25
Reporter: Tim Allison


I recently stubbed my toe on this one again.  At least in the 2.x branch, the 
module fontbox requires that the JAVA_HOME variable be set.  If it isn't set, 
the project build fails in fontbox without any meaningful indication as to why, 
even with the -X option set in maven.

{noformat}
(default-compile) on project fontbox: Compilation failure -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
execute goal org.apache.maven.plugins:maven-compiler-plugin:3.6.0:compile
(default-compile) on project fontbox: Compilation failure
at org.apache.maven.lifecycle.internal.MojoExecutor.execute
{noformat}

Also, on our website, there's no mention that JAVA_HOME should be set.  And, 
yes, I realize that it is set on most developers' systems. :D

One solution would be to add this rule to the maven-enforcer-plugin 
configuration in the parent pom:
{code:java}
  
JAVA_HOME
The JAVA_HOME environment variable must be 
set!
{code}

If this is ok, I'll add this rule in 2.x and see if I get the same behavior in 
trunk.

Side note: This was probably the cause of: 
https://www.mail-archive.com/users@pdfbox.apache.org/msg11423.html and a few 
other issues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5358) Add support for UTF-8 in strings

2022-01-06 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5358:
---

 Summary: Add support for UTF-8 in strings
 Key: PDFBOX-5358
 URL: https://issues.apache.org/jira/browse/PDFBOX-5358
 Project: PDFBox
  Issue Type: Improvement
Reporter: Tim Allison
 Attachments: Screen Shot 2022-01-06 at 9.18.09 AM.png

Peter Wyatt recently published an article on UTF-8 strings in PDF 2.0: 
[https://www.pdfa.org/understanding-utf-8-in-pdf-2-0/]

The article includes a link to a test file he created: 
[https://github.com/pdf-association/pdf20examples/blob/master/pdf20-utf8-test.pdf]
 

Our debugger shows that we may need to add support for this (see attached).  
This was with PDFBox 2.0.25.  I didn't have a chance to test with 3.x or the 
2.x snapshot.

I don't think we're necessarily covering all the changes yet in PDF 2.0, but I 
thought I'd open this issue for at least discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5164) Create portable collection PDF

2021-04-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326042#comment-17326042
 ] 

Tim Allison commented on PDFBOX-5164:
-

Thank you, [~tilman]!

> Create portable collection PDF
> --
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: CreatePortableCollection.java, MakePackage.java, 
> PortableCollection.pdf, collection.pdf, image-2021-04-15-16-02-42-451.png, 
> screenshot-1.png, tika-output.json, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5164) Create portable collection PDF

2021-04-20 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325972#comment-17325972
 ] 

Tim Allison commented on PDFBOX-5164:
-

Sorry to hijack this, but I wanted to confirm with [~zxltmj]...is this the 
output that you'd expect?  This is the recursive parser wrapper from Tika, 
which uses PDFBox.  I just want to confirm that we don't have to do anything 
else to handle portable collections.

> Create portable collection PDF
> --
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: CreatePortableCollection.java, MakePackage.java, 
> PortableCollection.pdf, collection.pdf, image-2021-04-15-16-02-42-451.png, 
> screenshot-1.png, tika-output.json, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5164) Create portable collection PDF

2021-04-20 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5164:

Attachment: tika-output.json

> Create portable collection PDF
> --
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: CreatePortableCollection.java, MakePackage.java, 
> PortableCollection.pdf, collection.pdf, image-2021-04-15-16-02-42-451.png, 
> screenshot-1.png, tika-output.json, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324082#comment-17324082
 ] 

Tim Allison commented on PDFBOX-5166:
-

Ha @bitsgalore has an example of subtype=Screen.  Yay! 

https://twitter.com/_tallison/status/1383164998629924870?s=20

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
>Reporter: Tim Allison
>Priority: Minor
>  Labels: Annotations
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324048#comment-17324048
 ] 

Tim Allison commented on PDFBOX-5166:
-

Are those also streams in subtype=RichMedia or do we need to look for other 
subtypes?

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
>Reporter: Tim Allison
>Priority: Minor
>  Labels: Annotations
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324002#comment-17324002
 ] 

Tim Allison edited comment on PDFBOX-5166 at 4/16/21, 6:07 PM:
---

Extraction only, yes...for our purposes on Tika, we wouldn't have any need to 
add or modify.  I'm ok with Tilman's example code for now, but I worry that 
we'll likely come across some required special handling that it would be better 
to have in PDFBox.  

This isn't high priority, and I don't see a need to backport to 2.x.

Separate topic...I'm wondering now if there are other annotation types that 
might conceal embedded files?


was (Author: talli...@mitre.org):
Extraction only, yes...for our purposes on Tika, we wouldn't have any need to 
add or modify.  I'm ok with Tilman's example code for now, but I worry that 
we'll likely come across some required special handling that'd it would be 
better to have in PDFBox.  

This isn't high priority, and I don't see a need to backport to 2.x.

Separate topic...I'm wondering now if there are other annotation types that 
might conceal embedded files?

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324002#comment-17324002
 ] 

Tim Allison commented on PDFBOX-5166:
-

Extraction only, yes...for our purposes on Tika, we wouldn't have any need to 
add or modify.  I'm ok with Tilman's example code for now, but I worry that 
we'll likely come across some required special handling that'd it would be 
better to have in PDFBox.  

This isn't high priority, and I don't see a need to backport to 2.x.

Separate topic...I'm wondering now if there are other annotation types that 
might conceal embedded files?

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5166:

Issue Type: New Feature  (was: Task)

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323831#comment-17323831
 ] 

Tim Allison edited comment on PDFBOX-5165 at 4/16/21, 1:52 PM:
---

Thank you for the quick fix!

Unless there are needs on other projects, we have no immediate need on the Tika 
side.  Let's wait a bit to see if anything else falls out of the regression 
tests with PDFBox 3.0.0-SNAPSHOT.

At some point, it would be great to have an updated jempbox for this issue and 
also for the rare date/time concurrency issue.


was (Author: talli...@mitre.org):
Unless there are needs on other projects, we have no immediate need on the Tika 
side.  Let's wait a bit to see if anything else falls out of the regression 
tests with PDFBox 3.0.0-SNAPSHOT.

At some point, it would be great to have an updated jempbox for this issue and 
also for the rare date/time concurrency issue.

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>  Components: JempBox
>Affects Versions: 1.8.16
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: optimization
> Fix For: 1.8.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323831#comment-17323831
 ] 

Tim Allison commented on PDFBOX-5165:
-

Unless there are needs on other projects, we have no immediate need on the Tika 
side.  Let's wait a bit to see if anything else falls out of the regression 
tests with PDFBox 3.0.0-SNAPSHOT.

At some point, it would be great to have an updated jempbox for this issue and 
also for the rare date/time concurrency issue.

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>  Components: JempBox
>Affects Versions: 1.8.16
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: optimization
> Fix For: 1.8.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5166:

Priority: Minor  (was: Major)

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323809#comment-17323809
 ] 

Tim Allison commented on PDFBOX-5166:
-

Completely unsurprisingly, [~tilman] has already shown how to extract these 
files on SO: 
https://stackoverflow.com/questions/45460027/what-is-the-best-way-to-extract-embedded-flash-file-from-a-pdf-using-the-pdfbox

If this is a "not going to fix", no problem!  I'm happy to put that code into 
Tika for now, and if a RichMedia annotation gets implemented in PDFBox, I can 
update our code accordingly.

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5166:
---

 Summary: Implement RichMedia annotation
 Key: PDFBOX-5166
 URL: https://issues.apache.org/jira/browse/PDFBOX-5166
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison
 Attachments: testFlashInPDF.pdf

See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
currently extracting the embedded file.

In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
COSDictionary, I can see the subtype is "RichMedia".  If someone has the time, 
it'd be great to implement this so that we can extract more attachments in 
Tika...  Obv, others may find use too. :D

Many thanks to Tyler Thorsted for the test file and many thanks to 
@terminalboredom and @beet_keeper.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322323#comment-17322323
 ] 

Tim Allison commented on PDFBOX-5165:
-

I realize that Jempbox is out dated, but we're still using it in Tika.  I found 
a PDF with a large event list in the media management schema.  Calling 
getHistory() on it takes a couple of minutes. :(

Is there any simple fix available?

The XMP is in this file: 
http://corpora.tika.apache.org/base/docs/commoncrawl3_refetched/MR/MRWP762LL3DMIFWGZPVPZXJNFGUAGHML
 

I tried to zip the extracted xmp and attach it here with no luck.

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement

2021-04-15 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5165:
---

 Summary: Exceedingly slow processing of XMPSchemaMediaManagement
 Key: PDFBOX-5165
 URL: https://issues.apache.org/jira/browse/PDFBOX-5165
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-15 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5165:

Summary: Exceedingly slow processing of XMPSchemaMediaManagement's 
getHistory in JempBox  (was: Exceedingly slow processing of 
XMPSchemaMediaManagement)

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

2021-04-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317514#comment-17317514
 ] 

Tim Allison edited comment on PDFBOX-5158 at 4/9/21, 1:36 PM:
--

Which in turn led me to find a bug in Tika's integration with 3.x: 
https://github.com/apache/tika/commit/f336c599a5536c7d3a7c0a0c94c71c8b695832ec

:D

Oh, dear, this bug is active in the wild in 1.26... TIKA-3350 :(

Three cheers for collaboration across projects!


was (Author: talli...@mitre.org):
Which in turn led me to find a bug in Tika's integration with 3.x: 
https://github.com/apache/tika/commit/f336c599a5536c7d3a7c0a0c94c71c8b695832ec

:D

> Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
> 
>
> Key: PDFBOX-5158
> URL: https://issues.apache.org/jira/browse/PDFBOX-5158
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Critical
> Fix For: 3.0.0 PDFBox
>
>
> I found a bunch of files that had a "read too many EOFs", which is a safety 
> check we now do in TikaInputStream to identify parsers that read an EOF > 
> 1000 times, which may be a sign of an infinite loop.
> When I turn off this safety check in TikaInputStream, I get an infinite loop.
> This is one of the triggering files: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W
> It's a truncated file from Common Crawl.
> The stacktrace when this is thrown is:
> {noformat}
> afterRead:809, TikaInputStream (org.apache.tika.io)
> read:82, ProxyInputStream (org.apache.commons.io.input)
> :113, RandomAccessReadBuffer (org.apache.pdfbox.io)
> loadPDF:454, Loader (org.apache.pdfbox)
> loadPDF:430, Loader (org.apache.pdfbox)
> getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
> parse:148, PDFParser (org.apache.tika.parser.pdf)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:150, AutoDetectParser (org.apache.tika.parser)
> parse:157, RecursiveParserWrapper (org.apache.tika.parser)
> getRecursiveMetadata:379, TikaTest (org.apache.tika)
> getRecursiveMetadata:369, TikaTest (org.apache.tika)
> getRecursiveMetadata:357, TikaTest (org.apache.tika)
> getRecursiveMetadata:351, TikaTest (org.apache.tika)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

2021-04-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317514#comment-17317514
 ] 

Tim Allison commented on PDFBOX-5158:
-

Which in turn led me to find a bug in Tika's integration with 3.x: 
https://github.com/apache/tika/commit/f336c599a5536c7d3a7c0a0c94c71c8b695832ec

:D

> Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
> 
>
> Key: PDFBOX-5158
> URL: https://issues.apache.org/jira/browse/PDFBOX-5158
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I found a bunch of files that had a "read too many EOFs", which is a safety 
> check we now do in TikaInputStream to identify parsers that read an EOF > 
> 1000 times, which may be a sign of an infinite loop.
> When I turn off this safety check in TikaInputStream, I get an infinite loop.
> This is one of the triggering files: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W
> It's a truncated file from Common Crawl.
> The stacktrace when this is thrown is:
> {noformat}
> afterRead:809, TikaInputStream (org.apache.tika.io)
> read:82, ProxyInputStream (org.apache.commons.io.input)
> :113, RandomAccessReadBuffer (org.apache.pdfbox.io)
> loadPDF:454, Loader (org.apache.pdfbox)
> loadPDF:430, Loader (org.apache.pdfbox)
> getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
> parse:148, PDFParser (org.apache.tika.parser.pdf)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:150, AutoDetectParser (org.apache.tika.parser)
> parse:157, RecursiveParserWrapper (org.apache.tika.parser)
> getRecursiveMetadata:379, TikaTest (org.apache.tika)
> getRecursiveMetadata:369, TikaTest (org.apache.tika)
> getRecursiveMetadata:357, TikaTest (org.apache.tika)
> getRecursiveMetadata:351, TikaTest (org.apache.tika)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

2021-04-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317509#comment-17317509
 ] 

Tim Allison commented on PDFBOX-5158:
-

Y, I get your stacktrace with a file, but I get an infinite loop with an 
inputstream.

{noformat}
Path path = Paths.get("OELHPKYAQPDNDWC535NE23Z6FKYRMN7W.pdf");
try (InputStream is = Files.newInputStream(path)) {
Loader.loadPDF(is);
}
{noformat}


> Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
> 
>
> Key: PDFBOX-5158
> URL: https://issues.apache.org/jira/browse/PDFBOX-5158
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I found a bunch of files that had a "read too many EOFs", which is a safety 
> check we now do in TikaInputStream to identify parsers that read an EOF > 
> 1000 times, which may be a sign of an infinite loop.
> When I turn off this safety check in TikaInputStream, I get an infinite loop.
> This is one of the triggering files: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W
> It's a truncated file from Common Crawl.
> The stacktrace when this is thrown is:
> {noformat}
> afterRead:809, TikaInputStream (org.apache.tika.io)
> read:82, ProxyInputStream (org.apache.commons.io.input)
> :113, RandomAccessReadBuffer (org.apache.pdfbox.io)
> loadPDF:454, Loader (org.apache.pdfbox)
> loadPDF:430, Loader (org.apache.pdfbox)
> getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
> parse:148, PDFParser (org.apache.tika.parser.pdf)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:150, AutoDetectParser (org.apache.tika.parser)
> parse:157, RecursiveParserWrapper (org.apache.tika.parser)
> getRecursiveMetadata:379, TikaTest (org.apache.tika)
> getRecursiveMetadata:369, TikaTest (org.apache.tika)
> getRecursiveMetadata:357, TikaTest (org.apache.tika)
> getRecursiveMetadata:351, TikaTest (org.apache.tika)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

2021-04-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317499#comment-17317499
 ] 

Tim Allison commented on PDFBOX-5158:
-

Hmmm...will try to replicate with pure PDFBox.  Thank you!

> Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
> 
>
> Key: PDFBOX-5158
> URL: https://issues.apache.org/jira/browse/PDFBOX-5158
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I found a bunch of files that had a "read too many EOFs", which is a safety 
> check we now do in TikaInputStream to identify parsers that read an EOF > 
> 1000 times, which may be a sign of an infinite loop.
> When I turn off this safety check in TikaInputStream, I get an infinite loop.
> This is one of the triggering files: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W
> It's a truncated file from Common Crawl.
> The stacktrace when this is thrown is:
> {noformat}
> afterRead:809, TikaInputStream (org.apache.tika.io)
> read:82, ProxyInputStream (org.apache.commons.io.input)
> :113, RandomAccessReadBuffer (org.apache.pdfbox.io)
> loadPDF:454, Loader (org.apache.pdfbox)
> loadPDF:430, Loader (org.apache.pdfbox)
> getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
> parse:148, PDFParser (org.apache.tika.parser.pdf)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:150, AutoDetectParser (org.apache.tika.parser)
> parse:157, RecursiveParserWrapper (org.apache.tika.parser)
> getRecursiveMetadata:379, TikaTest (org.apache.tika)
> getRecursiveMetadata:369, TikaTest (org.apache.tika)
> getRecursiveMetadata:357, TikaTest (org.apache.tika)
> getRecursiveMetadata:351, TikaTest (org.apache.tika)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

2021-04-08 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5158:

Description: 
I found a bunch of files that had a "read too many EOFs", which is a safety 
check we now do in TikaInputStream to identify parsers that read an EOF > 1000 
times, which may be a sign of an infinite loop.

When I turn off this safety check in TikaInputStream, I get an infinite loop.

This is one of the triggering files: 
https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W

It's a truncated file from Common Crawl.

The stacktrace when this is thrown is:
{noformat}
afterRead:809, TikaInputStream (org.apache.tika.io)
read:82, ProxyInputStream (org.apache.commons.io.input)
:113, RandomAccessReadBuffer (org.apache.pdfbox.io)
loadPDF:454, Loader (org.apache.pdfbox)
loadPDF:430, Loader (org.apache.pdfbox)
getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
parse:148, PDFParser (org.apache.tika.parser.pdf)
parse:288, CompositeParser (org.apache.tika.parser)
parse:288, CompositeParser (org.apache.tika.parser)
parse:150, AutoDetectParser (org.apache.tika.parser)
parse:157, RecursiveParserWrapper (org.apache.tika.parser)
getRecursiveMetadata:379, TikaTest (org.apache.tika)
getRecursiveMetadata:369, TikaTest (org.apache.tika)
getRecursiveMetadata:357, TikaTest (org.apache.tika)
getRecursiveMetadata:351, TikaTest (org.apache.tika)


{noformat}


  was:
I found a bunch of files that had a "read too many EOFs", which is a safety 
check we now do in TikaInputStream to identify parsers that read an EOF > 1000 
times, which may be a sign of an infinite loop.

When I turn off this safety check in TikaInputStream, I get an infinite loop.

This is one of the triggering files: 
https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W

It's a truncated file from Common Crawl.

The stacktrace when this is thrown is:
{noformat}
afterRead:809, TikaInputStream (org.apache.tika.io)
read:82, ProxyInputStream (org.apache.commons.io.input)
:113, RandomAccessReadBuffer (org.apache.pdfbox.io)
loadPDF:454, Loader (org.apache.pdfbox)
loadPDF:430, Loader (org.apache.pdfbox)
getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
parse:148, PDFParser (org.apache.tika.parser.pdf)
parse:288, CompositeParser (org.apache.tika.parser)
parse:288, CompositeParser (org.apache.tika.parser)
parse:150, AutoDetectParser (org.apache.tika.parser)
parse:157, RecursiveParserWrapper (org.apache.tika.parser)
getRecursiveMetadata:379, TikaTest (org.apache.tika)
getRecursiveMetadata:369, TikaTest (org.apache.tika)
getRecursiveMetadata:357, TikaTest (org.apache.tika)
getRecursiveMetadata:351, TikaTest (org.apache.tika)


{noformat}

The stack 


> Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
> 
>
> Key: PDFBOX-5158
> URL: https://issues.apache.org/jira/browse/PDFBOX-5158
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> I found a bunch of files that had a "read too many EOFs", which is a safety 
> check we now do in TikaInputStream to identify parsers that read an EOF > 
> 1000 times, which may be a sign of an infinite loop.
> When I turn off this safety check in TikaInputStream, I get an infinite loop.
> This is one of the triggering files: 
> https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W
> It's a truncated file from Common Crawl.
> The stacktrace when this is thrown is:
> {noformat}
> afterRead:809, TikaInputStream (org.apache.tika.io)
> read:82, ProxyInputStream (org.apache.commons.io.input)
> :113, RandomAccessReadBuffer (org.apache.pdfbox.io)
> loadPDF:454, Loader (org.apache.pdfbox)
> loadPDF:430, Loader (org.apache.pdfbox)
> getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
> parse:148, PDFParser (org.apache.tika.parser.pdf)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:288, CompositeParser (org.apache.tika.parser)
> parse:150, AutoDetectParser (org.apache.tika.parser)
> parse:157, RecursiveParserWrapper (org.apache.tika.parser)
> getRecursiveMetadata:379, TikaTest (org.apache.tika)
> getRecursiveMetadata:369, TikaTest (org.apache.tika)
> getRecursiveMetadata:357, TikaTest (org.apache.tika)
> getRecursiveMetadata:351, TikaTest (org.apache.tika)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5158) Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT

2021-04-08 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5158:
---

 Summary: Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT
 Key: PDFBOX-5158
 URL: https://issues.apache.org/jira/browse/PDFBOX-5158
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison


I found a bunch of files that had a "read too many EOFs", which is a safety 
check we now do in TikaInputStream to identify parsers that read an EOF > 1000 
times, which may be a sign of an infinite loop.

When I turn off this safety check in TikaInputStream, I get an infinite loop.

This is one of the triggering files: 
https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W

It's a truncated file from Common Crawl.

The stacktrace when this is thrown is:
{noformat}
afterRead:809, TikaInputStream (org.apache.tika.io)
read:82, ProxyInputStream (org.apache.commons.io.input)
:113, RandomAccessReadBuffer (org.apache.pdfbox.io)
loadPDF:454, Loader (org.apache.pdfbox)
loadPDF:430, Loader (org.apache.pdfbox)
getPDDocument:189, PDFParser (org.apache.tika.parser.pdf)
parse:148, PDFParser (org.apache.tika.parser.pdf)
parse:288, CompositeParser (org.apache.tika.parser)
parse:288, CompositeParser (org.apache.tika.parser)
parse:150, AutoDetectParser (org.apache.tika.parser)
parse:157, RecursiveParserWrapper (org.apache.tika.parser)
getRecursiveMetadata:379, TikaTest (org.apache.tika)
getRecursiveMetadata:369, TikaTest (org.apache.tika)
getRecursiveMetadata:357, TikaTest (org.apache.tika)
getRecursiveMetadata:351, TikaTest (org.apache.tika)


{noformat}

The stack 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5153) New flatefilter exception on Tika unit test files with 3.0.0-RC1

2021-04-06 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5153:
---

 Summary: New flatefilter exception on Tika unit test files with 
3.0.0-RC1
 Key: PDFBOX-5153
 URL: https://issues.apache.org/jira/browse/PDFBOX-5153
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison


On TIKA-3347, we're integrating PDFBox 3.0.0-RC1.  We're getting new flate 
filter exceptions on a set of files that I _think_ I created with PDFBox a 
while ago.

Looks like we're also getting xref exceptions.

I would not be surprised in the least to learn that I did something wrong in 
the creation of these files and that they are corrupt!

I can replicate this issue with {{java -jar pdfbox-app-3.0.0-RC1.jar 
export:text}}

{noformat}
SEVERE: FlateFilter: stop reading corrupt stream due to a DataFormatException
Error extracting text for document [IOException]: 
java.util.zip.DataFormatException: invalid block type
{noformat}

One of the files: 
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_no_extract_yes_accessibility_owner_user.pdf
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303522#comment-17303522
 ] 

Tim Allison commented on PDFBOX-5128:
-

The process hasn't finished, but I'm dumping the files here:

[https://corpora.tika.apache.org/base/xmps/]

I'm roughly binning them by the file type of the container file, including: 
[https://corpora.tika.apache.org/base/xmps/pdf/] 

 

Let me know if I can do any processing on these or if I botched the extraction.

 

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303391#comment-17303391
 ] 

Tim Allison edited comment on PDFBOX-5128 at 3/17/21, 1:01 PM:
---

Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that 
Oracle Outside In (at least back in 2011) didn't include a closing packet – 
PDFBOX-1192

!image-2021-03-17-09-00-57-653.png! --


was (Author: talli...@mitre.org):
Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that 
Oracle Outsid !image-2021-03-17-09-00-57-653.png! e In (at least back in 2011) 
didn't include a closing packet – PDFBOX-1192

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5128:

Attachment: image-2021-03-17-09-00-57-653.png

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-17 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303391#comment-17303391
 ] 

Tim Allison commented on PDFBOX-5128:
-

Side note...I'm looking at the EOFs for my xmp byte scanner, and I notice that 
Oracle Outsid !image-2021-03-17-09-00-57-653.png! e In (at least back in 2011) 
didn't include a closing packet – PDFBOX-1192

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip, image-2021-03-17-09-00-57-653.png
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302946#comment-17302946
 ] 

Tim Allison commented on PDFBOX-5128:
-

[~msahyoun] ... does the attached look about right?  If so, I'll run against 
our full corpus and mirror the directory structure.

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5128:

Attachment: PDFBOX.zip

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
> Attachments: PDFBOX.zip
>
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302784#comment-17302784
 ] 

Tim Allison commented on PDFBOX-5133:
-

+1 that's how I got the rest of the build to work on Ubuntu.  Thank you!

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Affects Versions: 2.0.22
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Trivial
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, 
> out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png
>
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302700#comment-17302700
 ] 

Tim Allison commented on PDFBOX-5133:
-

[~msahyoun] failed the build on Ubuntu.  I had no problems with  openjdk 11 on 
my Mac.

openjdk version "11.0.4" 2019-07-16

OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11)

OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode)

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, 
> out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png
>
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5133:

Attachment: out-testPDF_acroForm.pdf-7.png-diff.png
out-testPDF_acroForm.pdf-7.png

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: in-testPDF_acroForm.pdf-7.png, 
> out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png
>
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5133:

Attachment: out-testPDF_acroForm.pdf

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, 
> out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png
>
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302596#comment-17302596
 ] 

Tim Allison commented on PDFBOX-5133:
-

I _think_ I attached the right files to help with diagnosis.  Please let me 
know if there's anything else I can do.  Thank you!

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: in-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf, 
> out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png
>
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5133:

Attachment: in-testPDF_acroForm.pdf-7.png

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: in-testPDF_acroForm.pdf-7.png, 
> out-testPDF_acroForm.pdf-7.png, out-testPDF_acroForm.pdf-7.png-diff.png
>
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5133:

Attachment: (was: image-2021-03-16-10-57-14-639.png)

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5133:

Attachment: (was: image-2021-03-16-10-57-14-489.png)

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5133:

Attachment: (was: testPDF_acroForm.pdf-7.png)

> Failing testFlattenPDFBox2469Filled on Ubuntu 
> --
>
> Key: PDFBOX-5133
> URL: https://issues.apache.org/jira/browse/PDFBOX-5133
> Project: PDFBox
>  Issue Type: Task
>  Components: AcroForm
>Reporter: Tim Allison
>Priority: Trivial
>
> I tried to build the 2.0.23 candidate, but I got a test failure on the above 
> test.  This isn't worth respinning another candidate, but how can I help fix 
> this?
>  
> {noformat}
> Distributor ID:   Ubuntu
> Description:  Ubuntu 20.04.2 LTS
> Release:  20.04
> Codename: focal
> {noformat}
>  
> {noformat}
> openjdk version "1.8.0_282"
> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
>  {noformat}
>  
> {noformat}
> Files differ: 
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
>   
> /home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5133) Failing testFlattenPDFBox2469Filled on Ubuntu

2021-03-16 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5133:
---

 Summary: Failing testFlattenPDFBox2469Filled on Ubuntu 
 Key: PDFBOX-5133
 URL: https://issues.apache.org/jira/browse/PDFBOX-5133
 Project: PDFBox
  Issue Type: Task
  Components: AcroForm
Reporter: Tim Allison


I tried to build the 2.0.23 candidate, but I got a test failure on the above 
test.  This isn't worth respinning another candidate, but how can I help fix 
this?

 
{noformat}
Distributor ID: Ubuntu
Description:Ubuntu 20.04.2 LTS
Release:20.04
Codename:   focal
{noformat}
 
{noformat}
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode)
 {noformat}
 
{noformat}
Files differ: 
/home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/in/testPDF_acroForm.pdf-7.png
  
/home/tallison/tools/pdfbox/pdfbox-2.0.23/pdfbox/target/test-output/flatten/out/testPDF_acroForm.pdf-7.png
 {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300589#comment-17300589
 ] 

Tim Allison commented on PDFBOX-5127:
-

My personal pref would be to generate SimpleDateFormat objects as needed.  The 
good news either way (maybe?) is that this is in an exception handling bit, and 
I don't think I've seen it before so it should be pretty rare???

> Multithreading issue in JempBox's DateConverter
> ---
>
> Key: PDFBOX-5127
> URL: https://issues.apache.org/jira/browse/PDFBOX-5127
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> [~tilman] recently found an exception thrown from here 
> ([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
>  in one run of tika-eval but not in another. 
>  
> This is a multithreading issue caused by 
> [https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
>  SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
> earlier, but so it goes.
>  
> Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5128) Support parsing non standardized XMP

2021-03-12 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300365#comment-17300365
 ] 

Tim Allison commented on PDFBOX-5128:
-

I’ll scrape xmp out of our regression corpus. I should retain the packet 
envelope?

> Support parsing non standardized XMP 
> -
>
> Key: PDFBOX-5128
> URL: https://issues.apache.org/jira/browse/PDFBOX-5128
> Project: PDFBox
>  Issue Type: Task
>  Components: XmpBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>
> XMP currently only supports parsing known XMP schema as has been discussed. 
> That shall be extended to support arbitrary but valid  XMP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5127) Multithreading issue in JempBox's DateConverter

2021-03-12 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5127:
---

 Summary: Multithreading issue in JempBox's DateConverter
 Key: PDFBOX-5127
 URL: https://issues.apache.org/jira/browse/PDFBOX-5127
 Project: PDFBox
  Issue Type: Bug
Reporter: Tim Allison


[~tilman] recently found an exception thrown from here 
([https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L186)]
 in one run of tika-eval but not in another. 

 

This is a multithreading issue caused by 
[https://github.com/apache/pdfbox/blob/1.8/jempbox/src/main/java/org/apache/jempbox/impl/DateConverter.java#L43]
 SimpleDateFormat is not threadsafe.  I'm surprised we haven't seen this 
earlier, but so it goes.

 

Many, many thanks to Tilman for finding this!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3953) StackOverflowError in org.apache.pdfbox.pdmodel.PDPageTree.getKids

2020-11-04 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226417#comment-17226417
 ] 

Tim Allison commented on PDFBOX-3953:
-

Related?

> StackOverflowError in org.apache.pdfbox.pdmodel.PDPageTree.getKids
> --
>
> Key: PDFBOX-3953
> URL: https://issues.apache.org/jira/browse/PDFBOX-3953
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.7
>Reporter: Jorge Spinsanti
>Priority: Major
>
> I got an StackOverflowError in 
> org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:135)
> {code}
> java.lang.StackOverflowError
>   at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:135)
>   at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
>   at 
> org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169)
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5009) Corrupt PDF can lead to a StackOverflow

2020-11-04 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5009:
---

 Summary: Corrupt PDF can lead to a StackOverflow
 Key: PDFBOX-5009
 URL: https://issues.apache.org/jira/browse/PDFBOX-5009
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison


See TIKA-3224.  I confirmed this with 2.0.21 by calling the app's ExtractText 
on the file posted on the Tika issue.

cc [~dadoonet]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4623) COSParser: Infinite recursion

2020-02-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037202#comment-17037202
 ] 

Tim Allison commented on PDFBOX-4623:
-

Adding a page tree infinite loop.

> COSParser: Infinite recursion
> -
>
> Key: PDFBOX-4623
> URL: https://issues.apache.org/jira/browse/PDFBOX-4623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.16
> Environment: java version "12" 2019-03-19
> Java(TM) SE Runtime Environment (build 12+33)
> Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) 
> MacOS Mojave
>Reporter: Alex Rebert
>Priority: Minor
> Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf
>
>
> Parsing an invalid PDF can lead to an infinite recursion in COSParser, which 
> results in a StackOverflowError.
> *Steps to repro*
>  # Download malformed PDF (attached)
>  # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}}
> *Stacktrace*
> {noformat}
> Exception in thread "main" java.lang.StackOverflowError [1005/1916]
>  at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79)
>  at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210)
>  at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321)
>  at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414)
>  at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578)
>  at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801)
>  at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  ...
> {noformat}
> The file was generated by fuzzing and is (probably) not a valid PDF file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4623) COSParser: Infinite recursion

2020-02-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037202#comment-17037202
 ] 

Tim Allison edited comment on PDFBOX-4623 at 2/14/20 6:51 PM:
--

Adding a page tree stackoverflow.


was (Author: talli...@mitre.org):
Adding a page tree infinite loop.

> COSParser: Infinite recursion
> -
>
> Key: PDFBOX-4623
> URL: https://issues.apache.org/jira/browse/PDFBOX-4623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.16
> Environment: java version "12" 2019-03-19
> Java(TM) SE Runtime Environment (build 12+33)
> Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) 
> MacOS Mojave
>Reporter: Alex Rebert
>Priority: Minor
> Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf
>
>
> Parsing an invalid PDF can lead to an infinite recursion in COSParser, which 
> results in a StackOverflowError.
> *Steps to repro*
>  # Download malformed PDF (attached)
>  # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}}
> *Stacktrace*
> {noformat}
> Exception in thread "main" java.lang.StackOverflowError [1005/1916]
>  at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79)
>  at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210)
>  at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321)
>  at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414)
>  at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578)
>  at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801)
>  at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  ...
> {noformat}
> The file was generated by fuzzing and is (probably) not a valid PDF file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4623) COSParser: Infinite recursion

2020-02-14 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-4623:

Attachment: loop_in_page_tree.pdf

> COSParser: Infinite recursion
> -
>
> Key: PDFBOX-4623
> URL: https://issues.apache.org/jira/browse/PDFBOX-4623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.16
> Environment: java version "12" 2019-03-19
> Java(TM) SE Runtime Environment (build 12+33)
> Java HotSpot(TM) 64-Bit Server VM (build 12+33, mixed mode, sharing) 
> MacOS Mojave
>Reporter: Alex Rebert
>Priority: Minor
> Attachments: infinite-recursion.pdf, loop_in_page_tree.pdf
>
>
> Parsing an invalid PDF can lead to an infinite recursion in COSParser, which 
> results in a StackOverflowError.
> *Steps to repro*
>  # Download malformed PDF (attached)
>  # {{Run: java -jar pdfbox-app-2.0.16.jar ExtractText infinite-recursion.pdf}}
> *Stacktrace*
> {noformat}
> Exception in thread "main" java.lang.StackOverflowError [1005/1916]
>  at java.base/sun.nio.cs.UTF_8.updatePositions(UTF_8.java:79)
>  at java.base/sun.nio.cs.UTF_8$Decoder.xflow(UTF_8.java:210)
>  at java.base/sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:321)
>  at java.base/sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:414)
>  at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:578)
>  at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:801)
>  at org.apache.pdfbox.pdfparser.BaseParser.isValidUTF8(BaseParser.java:787)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseCOSName(BaseParser.java:768)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:887)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
>  at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
>  at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:912)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:920)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:881)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:801)
>  at org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1055)
>  at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:1114)
>  ...
> {noformat}
> The file was generated by fuzzing and is (probably) not a valid PDF file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4768) Unable to extract text from PDF

2020-02-07 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032556#comment-17032556
 ] 

Tim Allison commented on PDFBOX-4768:
-

To complement Tilman's points...qpdf complains about this file:

{noformat}
WARNING: kst-31430-3-b3_unextractable.pdf: file is damaged
WARNING: kst-31430-3-b3_unextractable.pdf (offset 638658): xref not found
WARNING: kst-31430-3-b3_unextractable.pdf: Attempting to reconstruct 
cross-reference table
WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 214900): 
expected endstream
WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 211564): 
attempting to recover stream length
WARNING: kst-31430-3-b3_unextractable.pdf (object 123 0, offset 211564): 
recovered stream length: 13564
qpdf: operation succeeded with warnings; resulting file may have some problems
{noformat}

Tika's exception is:

{noformat}
Caused by: java.io.IOException: Unknown dir object c=')' cInt=41 peek=')' 
peekInt=41 at offset 8689
at 
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:966)
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:636)
at 
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:513)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480)
at 
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153)
at 
org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
at 
org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:124)
{noformat}

 

> Unable to extract text from PDF
> ---
>
> Key: PDFBOX-4768
> URL: https://issues.apache.org/jira/browse/PDFBOX-4768
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.18
>Reporter: Jan Vlug
>Priority: Major
> Attachments: kst-31430-3-b3_unextractable.pdf
>
>
> I have a PDF document (see attachment) that can be viewed in Evince, but tika 
> text extraction does not work. I think that this is due to a crash in pdfbox.
> I'm also a bit puzzled by the message: "You do not have permission to extract 
> text".
> Here the output of the ExtractText command:
> {{java -jar pdfbox-app-2.0.19-20200206.060243-86.jar ExtractText 
> kst-31430-3-b3_unextractable.pdf tekst_jan.txt}}
> {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser 
> validateStreamLength}}
> {{WARNING: The end of the stream doesn't point to the correct offset, using 
> workaround to read the stream, stream start position: 211564, length: 3336, 
> expected end position: 214900}}
> {{Feb 07, 2020 11:03:15 AM org.apache.pdfbox.pdfparser.COSParser 
> parseCOSStream}}
> {{WARNING: stream ends with 'endobj' instead of 'endstream' at offset 225134}}
> {{Exception in thread "main" java.io.IOException: You do not have permission 
> to extract text}}
> {{ at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:223)}}
> {{ at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)}}
> {{ at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4737) Text extraction is gibberish

2020-01-15 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016209#comment-17016209
 ] 

Tim Allison commented on PDFBOX-4737:
-

The following reinforces points already made, I think.

>On the other hand of course a proper implementation of a strict mode will 
>require quite a lot of work

+1

> and a half-hearted implementation is worthless.

Indications of specific types of wonkiness – e.g. missing fonts, missing 
unicode mappings, missing/invalid xref, many other features – would be useful 
to some downstream processors, and if we did a "group by" on "producer/creator 
tool" for a given corpus like CommonCrawl, we might be able to shame software 
companies and projects into fixing specific issues.  We could add these 
incrementally... and I see some benefit from even partial information (missing 
unicode mappings).

As I and others point out, though, text can always be hosed, and there is no 
perfect "junk detector".  You can try to use tika-eval's out of vocabulary 
statistic as an indicator that the text is not "languagey", but it will 
incorrectly categorize parts lists, isbns, duck phyla as "bad."  More advanced 
machine learning (e.g. neural nets) may do a better job, but they will still be 
wrong some of the time.

 

There's a reason Google is running OCR on at least some PDFs. :P

 

So, from an OS community perspective, I see two avenues of work:
 # improving reporting of "nonstandard" features of the PDF – or helping 
developers understand what types of "nonstandard" features can currently be 
detected with PDFBox
 # working together to improve a junk detector... a la Tika's

> Text extraction is gibberish
> 
>
> Key: PDFBOX-4737
> URL: https://issues.apache.org/jira/browse/PDFBOX-4737
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.18
>Reporter: Jorge Spinsanti
>Priority: Major
> Attachments: noUnicodeMapping.pdf, obfuscateTest_Duplicate_2_3.pdf
>
>
> As it was discussed on https://issues.apache.org/jira/browse/PDFBOX-4549 
> there are many PDFs where the text extraction is gibberish.
> Perhaps you can add two modes (strict/lax) to text extraction to avoid 
> gibberish if not useful. Add a file to analyze the problem.
> [^noUnicodeMapping.pdf]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4549) No Unicode mapping

2020-01-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010747#comment-17010747
 ] 

Tim Allison commented on PDFBOX-4549:
-

And then there's this gem on content masking attacks: 
[https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/markwood]
 .  Many thanks to Peter Wyatt for bringing Markwood et al's work to my 
attention.

> No Unicode mapping
> --
>
> Key: PDFBOX-4549
> URL: https://issues.apache.org/jira/browse/PDFBOX-4549
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.15
>Reporter: Sergey Makarov
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.16, 3.0.0 PDFBox
>
> Attachments: XO_Thames.zip, our_star_wars.pdf
>
>
> Hello, if i try get text from pdf (attached), i will result empty out and 
> many warns. Font attached also.
>  Acrobat reader will open succeed, I can select, copy text and save as text
> my code:
> {code:java}
> private static void parseOne(String path) throws IOException {
> String pdfFileInText;
> PDFTextStripper tStripper;
> File file = new File(path);
> tStripper = new PDFTextStripper();
> MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, 
> 5).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
> PDDocument document = PDDocument.load(file, memUsageSetting);
> if (!document.isEncrypted()) {
> pdfFileInText = tStripper.getText(document);
> System.out.print(pdfFileInText);
> }
> document.close();
> }{code}
> Error:
> {code:java}
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4549) No Unicode mapping

2020-01-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010741#comment-17010741
 ] 

Tim Allison commented on PDFBOX-4549:
-

These are good points [~mkl].  See e.g.: 
[http://www.vintasoft.com/forums/viewtopic.php?t=2320] for willful/intentional 
obfuscation of test.

Note that Google is running OCR on at least some PDFs.  See slides 50-51: 
[https://github.com/tballison/share/blob/master/slides/activate19/Activate2019_tika_tallison_20190911.pptx]

And even OCR can be gamed: [https://arxiv.org/abs/1802.05385]

 

:(

> No Unicode mapping
> --
>
> Key: PDFBOX-4549
> URL: https://issues.apache.org/jira/browse/PDFBOX-4549
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.15
>Reporter: Sergey Makarov
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.16, 3.0.0 PDFBox
>
> Attachments: XO_Thames.zip, our_star_wars.pdf
>
>
> Hello, if i try get text from pdf (attached), i will result empty out and 
> many warns. Font attached also.
>  Acrobat reader will open succeed, I can select, copy text and save as text
> my code:
> {code:java}
> private static void parseOne(String path) throws IOException {
> String pdfFileInText;
> PDFTextStripper tStripper;
> File file = new File(path);
> tStripper = new PDFTextStripper();
> MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, 
> 5).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
> PDDocument document = PDDocument.load(file, memUsageSetting);
> if (!document.isEncrypted()) {
> pdfFileInText = tStripper.getText(document);
> System.out.print(pdfFileInText);
> }
> document.close();
> }{code}
> Error:
> {code:java}
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4549) No Unicode mapping

2020-01-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009133#comment-17009133
 ] 

Tim Allison commented on PDFBOX-4549:
-

Perhaps tika-eval's out of vocabulary statistic?  Or implement your own from: 
[https://dl.acm.org/doi/10.1145/1600193.1600237]

> No Unicode mapping
> --
>
> Key: PDFBOX-4549
> URL: https://issues.apache.org/jira/browse/PDFBOX-4549
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.15
>Reporter: Sergey Makarov
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.16, 3.0.0 PDFBox
>
> Attachments: XO_Thames.zip, our_star_wars.pdf
>
>
> Hello, if i try get text from pdf (attached), i will result empty out and 
> many warns. Font attached also.
>  Acrobat reader will open succeed, I can select, copy text and save as text
> my code:
> {code:java}
> private static void parseOne(String path) throws IOException {
> String pdfFileInText;
> PDFTextStripper tStripper;
> File file = new File(path);
> tStripper = new PDFTextStripper();
> MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, 
> 5).setTempDir(new File("/home/user/pdfBoxTest/newFiles/"));
> PDDocument document = PDDocument.load(file, memUsageSetting);
> if (!document.isEncrypted()) {
> pdfFileInText = tStripper.getText(document);
> System.out.print(pdfFileInText);
> }
> document.close();
> }{code}
> Error:
> {code:java}
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames
> May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4715) Need to add release version for maven-compiler-plugin

2019-12-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1763#comment-1763
 ] 

Tim Allison commented on PDFBOX-4715:
-

{noformat}
[ERROR] error: release version 6 not supported {noformat}
I'm not seeing that in mine with Maven 3.6.3.  That's the useful info I was 
hoping for!

> Need to add release version for maven-compiler-plugin
> -
>
> Key: PDFBOX-4715
> URL: https://issues.apache.org/jira/browse/PDFBOX-4715
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Fix For: 2.0.18
>
>
> If I build PDFBox with > Java 8, but then try to run it via Tika with Java 8, 
> I get:
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
> Could not initialize class org.apache.pdfbox.pdmodel.font.PDType1Font
> BatchProcess: at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> BatchProcess: at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> BatchProcess: at 
> org.apache.tika.batch.BatchProcess.mainLoop(BatchProcess.java:206)
> BatchProcess: at 
> org.apache.tika.batch.BatchProcess.call(BatchProcess.java:166)
> BatchProcess: at org.apache.tika.batch.BatchProcess.call(BatchProcess.java:52)
> BatchProcess: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> BatchProcess: at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> BatchProcess: at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> BatchProcess: at java.lang.Thread.run(Thread.java:748)
> BatchProcess:Caused by: java.lang.NoClassDefFoundError: Could not initialize 
> class org.apache.pdfbox.pdmodel.font.PDType1Font
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
> BatchProcess: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> BatchProcess: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> BatchProcess: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
> BatchProcess: at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
> {noformat}
> and 
> {noformat}
> java.lang.NoSuchMethodError: 
> java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Lexer.readToken(Type1Lexer.java:184)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Lexer.init(Type1Lexer.java:64)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:86)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:56)
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getType1Font(FileSystemFontProvider.java:259)
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:131)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4715) Need to add release version for maven-compiler-plugin

2019-12-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1757#comment-1757
 ] 

Tim Allison commented on PDFBOX-4715:
-

Added requireJavaVersion in 2.x branch.  

> Need to add release version for maven-compiler-plugin
> -
>
> Key: PDFBOX-4715
> URL: https://issues.apache.org/jira/browse/PDFBOX-4715
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Blocker
> Fix For: 2.0.18
>
>
> If I build PDFBox with > Java 8, but then try to run it via Tika with Java 8, 
> I get:
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
> Could not initialize class org.apache.pdfbox.pdmodel.font.PDType1Font
> BatchProcess: at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> BatchProcess: at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> BatchProcess: at 
> org.apache.tika.batch.BatchProcess.mainLoop(BatchProcess.java:206)
> BatchProcess: at 
> org.apache.tika.batch.BatchProcess.call(BatchProcess.java:166)
> BatchProcess: at org.apache.tika.batch.BatchProcess.call(BatchProcess.java:52)
> BatchProcess: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> BatchProcess: at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> BatchProcess: at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> BatchProcess: at java.lang.Thread.run(Thread.java:748)
> BatchProcess:Caused by: java.lang.NoClassDefFoundError: Could not initialize 
> class org.apache.pdfbox.pdmodel.font.PDType1Font
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:875)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:509)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:483)
> BatchProcess: at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
> BatchProcess: at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> BatchProcess: at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> BatchProcess: at 
> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:153)
> BatchProcess: at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:867)
> {noformat}
> and 
> {noformat}
> java.lang.NoSuchMethodError: 
> java.nio.ByteBuffer.position(I)Ljava/nio/ByteBuffer;
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Lexer.readToken(Type1Lexer.java:184)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Lexer.init(Type1Lexer.java:64)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:86)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61)
> BatchProcess: at 
> org.apache.fontbox.type1.Type1Font.createWithPFB(Type1Font.java:56)
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getType1Font(FileSystemFontProvider.java:259)
> BatchProcess: at 
> org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:131)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



  1   2   3   4   5   >