[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324082#comment-17324082
 ] 

Tim Allison commented on PDFBOX-5166:
-

Ha @bitsgalore has an example of subtype=Screen.  Yay! 

https://twitter.com/_tallison/status/1383164998629924870?s=20

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
>Reporter: Tim Allison
>Priority: Minor
>  Labels: Annotations
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: PDFBox 3.0.0-SNAPSHOT reports

2021-04-16 Thread Tim Allison
Hi All,
 I reran 2.0.23 with our added handling for flash files against the
3.0.0-SNAPSHOT that I ran yesterday.  The diffs look almost the same
as the reports I created yesterday, so I think those are accurate:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.23-richmedia.tgz

There are a handful of files that "lose" attachments going into
3.0.0-SNAPSHOT because I haven't added the richmedia handling in our
3.0.0 branch.

 Best,

   Tim

On Thu, Apr 15, 2021 at 7:15 PM Tim Allison  wrote:
>
> Diffs look suspiciously small...I may have to rerun the analyses.
>
> On Thu, Apr 15, 2021 at 7:08 PM Tim Allison  wrote:
> >
> > Latest here: 
> > https://corpora.tika.apache.org/base/reports/pdfbox-3.0.0-20210415_reports.tgz
> >
> > I haven't had a chance to look yet.  Will dig in tomorrow.

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324048#comment-17324048
 ] 

Tim Allison commented on PDFBOX-5166:
-

Are those also streams in subtype=RichMedia or do we need to look for other 
subtypes?

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
>Reporter: Tim Allison
>Priority: Minor
>  Labels: Annotations
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5166:

Labels: Annotations  (was: )

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>  Labels: Annotations
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5166:

Component/s: PDModel

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
>Reporter: Tim Allison
>Priority: Minor
>  Labels: Annotations
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324005#comment-17324005
 ] 

Maruan Sahyoun edited comment on PDFBOX-5166 at 4/16/21, 6:16 PM:
--

Yes there is -  multimedia content such as sound or video and there is 3D 
content. And there are collections. At the end of the day most boil down to 
being streams but I'm not sure if you detect and extract them. 


was (Author: msahyoun):
Yes there is -  multimedia content such as sound or video and there is 3D 
content. 

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324005#comment-17324005
 ] 

Maruan Sahyoun commented on PDFBOX-5166:


Yes there is -  multimedia content such as sound or video and there is 3D 
content. 

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324002#comment-17324002
 ] 

Tim Allison edited comment on PDFBOX-5166 at 4/16/21, 6:07 PM:
---

Extraction only, yes...for our purposes on Tika, we wouldn't have any need to 
add or modify.  I'm ok with Tilman's example code for now, but I worry that 
we'll likely come across some required special handling that it would be better 
to have in PDFBox.  

This isn't high priority, and I don't see a need to backport to 2.x.

Separate topic...I'm wondering now if there are other annotation types that 
might conceal embedded files?


was (Author: talli...@mitre.org):
Extraction only, yes...for our purposes on Tika, we wouldn't have any need to 
add or modify.  I'm ok with Tilman's example code for now, but I worry that 
we'll likely come across some required special handling that'd it would be 
better to have in PDFBox.  

This isn't high priority, and I don't see a need to backport to 2.x.

Separate topic...I'm wondering now if there are other annotation types that 
might conceal embedded files?

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324002#comment-17324002
 ] 

Tim Allison commented on PDFBOX-5166:
-

Extraction only, yes...for our purposes on Tika, we wouldn't have any need to 
add or modify.  I'm ok with Tilman's example code for now, but I worry that 
we'll likely come across some required special handling that'd it would be 
better to have in PDFBox.  

This isn't high priority, and I don't see a need to backport to 2.x.

Separate topic...I'm wondering now if there are other annotation types that 
might conceal embedded files?

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323999#comment-17323999
 ] 

Maruan Sahyoun commented on PDFBOX-5166:


Would it be enough for your purpose to implement the bits to being able to 
extract the Assets? 

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5164) Create portable collection PDF

2021-04-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937
 ] 

Tilman Hausherr edited comment on PDFBOX-5164 at 4/16/21, 5:20 PM:
---

Here's some code... there are not yet any collection classes in PDFBox. Get the 
EmbeddedFiles.java example from the source code and then add this near the end. 

This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);
doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, 
collectionDic);
doc.getDocumentCatalog().setVersion("1.7");


COSDictionary ciDict1 = new COSDictionary();
ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM);
// use the field names from earlier
ciDict1.setString("fieldone", "Very interesting file");
ciDict1.setString("fieldtwo", fs.getFile());
ciDict1.setInt("fieldthree", 333);
fs.getCOSObject().setItem(COSName.CI, ciDict1);
{code}

Use the latest snapshot in 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/
to get the new constants.

Result file:  [^collection.pdf] 


was (Author: tilman):
Here's some code... there are not yet any collection classes in PDFBox. Get the 
EmbeddedFiles.java example from the source code and then add this near the end. 

This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);

[jira] [Updated] (PDFBOX-5164) Create portable collection PDF

2021-04-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5164:

Summary: Create portable collection PDF  (was: out version is already 
2.0.23 .  I want customize the colums myself  in the red box use java.)

> Create portable collection PDF
> --
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Priority: Major
> Fix For: 2.0.24, 4.0.0
>
> Attachments: MakePackage.java, collection.pdf, 
> image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.

2021-04-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937
 ] 

Tilman Hausherr edited comment on PDFBOX-5164 at 4/16/21, 4:42 PM:
---

Here's some code... there are not yet any collection classes in PDFBox. Get the 
EmbeddedFiles.java example from the source code and then add this near the end. 

This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);
doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, 
collectionDic);
doc.getDocumentCatalog().setVersion("1.7");


COSDictionary ciDict1 = new COSDictionary();
ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM);
// use the field names from earlier
ciDict1.setString("fieldone", "Very interesting file");
ciDict1.setString("fieldtwo", fs.getFile());
ciDict1.setInt("fieldthree", 333);
fs.getCOSObject().setItem(COSName.CI, ciDict1);
{code}

Use the latest snapshot in 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/
(it's building right now) to get the new constants.

Result file:  [^collection.pdf] 


was (Author: tilman):
Here's some code... there are not yet any collection classes in PDFBox. Get the 
EmbeddedFiles.java example from the source code and then add this near the end. 

This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);

[jira] [Comment Edited] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.

2021-04-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937
 ] 

Tilman Hausherr edited comment on PDFBOX-5164 at 4/16/21, 4:41 PM:
---

Here's some code... there are not yet any collection classes in PDFBox. Get the 
EmbeddedFiles.java example from the source code and then add this near the end. 

This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);
doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, 
collectionDic);
doc.getDocumentCatalog().setVersion("1.7");


COSDictionary ciDict1 = new COSDictionary();
ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM);
// use the field names from earlier
ciDict1.setString("fieldone", "Very interesting file");
ciDict1.setString("fieldtwo", fs.getFile());
ciDict1.setInt("fieldthree", 333);
fs.getCOSObject().setItem(COSName.CI, ciDict1);
{code}

Use the latest snapshot in 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/
(it's building right now) to get the new constants


was (Author: tilman):
This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);
doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, 
collectionDic);
doc.getDocumentCatalog().setVersion("1.7");


COSDictionary ciDict1 = new COSDictionary();
ciDict1.setItem(COSName.TYPE, 

[jira] [Commented] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.

2021-04-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937
 ] 

Tilman Hausherr commented on PDFBOX-5164:
-

This is about a single file but you can add several ones. For that, create more 
"ciDict" dictionaries and of course more PDComplexFileSpecification objects.
{code}
COSDictionary collectionDic = new COSDictionary();
COSDictionary schemaDict = new COSDictionary();
schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA);
COSDictionary sortDic = new COSDictionary();
sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT);
sortDic.setString(COSName.A, "true"); // sort ascending
sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a 
field described in the parent collection dictionary"
collectionDic.setItem(COSName.TYPE, COSName.COLLECTION);
collectionDic.setItem(COSName.SCHEMA, schemaDict);
collectionDic.setItem(COSName.SORT, sortDic);
collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode
COSDictionary fieldDict1 = new COSDictionary();
fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict1.setString(COSName.N, "field header one"); // header text
fieldDict1.setInt(COSName.O, 1); // order on the screen
COSDictionary fieldDict2 = new COSDictionary();
fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field
fieldDict2.setString(COSName.N, "field header two");
fieldDict2.setInt(COSName.O, 2);
COSDictionary fieldDict3 = new COSDictionary();
fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD);
fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field
fieldDict3.setString(COSName.N, "field header three");
fieldDict3.setInt(COSName.O, 3);
schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key)
schemaDict.setItem("fieldtwo", fieldDict2);
schemaDict.setItem("fieldthree", fieldDict3);
doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, 
collectionDic);
doc.getDocumentCatalog().setVersion("1.7");


COSDictionary ciDict1 = new COSDictionary();
ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM);
// use the field names from earlier
ciDict1.setString("fieldone", "Very interesting file");
ciDict1.setString("fieldtwo", fs.getFile());
ciDict1.setInt("fieldthree", 333);
fs.getCOSObject().setItem(COSName.CI, ciDict1);
{code}

Use the latest snapshot in 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/
(it's building right now) to get the new constants

> out version is already 2.0.23 .  I want customize the colums myself  in the 
> red box use java.
> -
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Priority: Major
> Fix For: 2.0.24, 4.0.0
>
> Attachments: MakePackage.java, collection.pdf, 
> image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.

2021-04-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323935#comment-17323935
 ] 

ASF subversion and git services commented on PDFBOX-5164:
-

Commit 130 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r130 ]

PDFBOX-5164: add constants for collections

> out version is already 2.0.23 .  I want customize the colums myself  in the 
> red box use java.
> -
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Priority: Major
> Fix For: 2.0.24, 4.0.0
>
> Attachments: MakePackage.java, collection.pdf, 
> image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.

2021-04-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323934#comment-17323934
 ] 

ASF subversion and git services commented on PDFBOX-5164:
-

Commit 129 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r129 ]

PDFBOX-5164: add constants for collections

> out version is already 2.0.23 .  I want customize the colums myself  in the 
> red box use java.
> -
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Priority: Major
> Fix For: 2.0.24, 4.0.0
>
> Attachments: MakePackage.java, collection.pdf, 
> image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.

2021-04-16 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5164:

Attachment: collection.pdf

> out version is already 2.0.23 .  I want customize the colums myself  in the 
> red box use java.
> -
>
> Key: PDFBOX-5164
> URL: https://issues.apache.org/jira/browse/PDFBOX-5164
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.18
> Environment: java
>Reporter: zhouxiaolong
>Priority: Major
> Fix For: 2.0.24, 4.0.0
>
> Attachments: MakePackage.java, collection.pdf, 
> image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf
>
>
> !image-2021-04-15-16-02-42-451.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5166:

Issue Type: New Feature  (was: Task)

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323831#comment-17323831
 ] 

Tim Allison edited comment on PDFBOX-5165 at 4/16/21, 1:52 PM:
---

Thank you for the quick fix!

Unless there are needs on other projects, we have no immediate need on the Tika 
side.  Let's wait a bit to see if anything else falls out of the regression 
tests with PDFBox 3.0.0-SNAPSHOT.

At some point, it would be great to have an updated jempbox for this issue and 
also for the rare date/time concurrency issue.


was (Author: talli...@mitre.org):
Unless there are needs on other projects, we have no immediate need on the Tika 
side.  Let's wait a bit to see if anything else falls out of the regression 
tests with PDFBox 3.0.0-SNAPSHOT.

At some point, it would be great to have an updated jempbox for this issue and 
also for the rare date/time concurrency issue.

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>  Components: JempBox
>Affects Versions: 1.8.16
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: optimization
> Fix For: 1.8.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323831#comment-17323831
 ] 

Tim Allison commented on PDFBOX-5165:
-

Unless there are needs on other projects, we have no immediate need on the Tika 
side.  Let's wait a bit to see if anything else falls out of the regression 
tests with PDFBox 3.0.0-SNAPSHOT.

At some point, it would be great to have an updated jempbox for this issue and 
also for the rare date/time concurrency issue.

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>  Components: JempBox
>Affects Versions: 1.8.16
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: optimization
> Fix For: 1.8.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-5166:

Priority: Minor  (was: Major)

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323809#comment-17323809
 ] 

Tim Allison commented on PDFBOX-5166:
-

Completely unsurprisingly, [~tilman] has already shown how to extract these 
files on SO: 
https://stackoverflow.com/questions/45460027/what-is-the-best-way-to-extract-embedded-flash-file-from-a-pdf-using-the-pdfbox

If this is a "not going to fix", no problem!  I'm happy to put that code into 
Tika for now, and if a RichMedia annotation gets implemented in PDFBox, I can 
update our code accordingly.

> Implement RichMedia annotation
> --
>
> Key: PDFBOX-5166
> URL: https://issues.apache.org/jira/browse/PDFBOX-5166
> Project: PDFBox
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Attachments: testFlashInPDF.pdf
>
>
> See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
> currently extracting the embedded file.
> In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
> COSDictionary, I can see the subtype is "RichMedia".  If someone has the 
> time, it'd be great to implement this so that we can extract more attachments 
> in Tika...  Obv, others may find use too. :D
> Many thanks to Tyler Thorsted for the test file and many thanks to 
> @terminalboredom and @beet_keeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5166) Implement RichMedia annotation

2021-04-16 Thread Tim Allison (Jira)
Tim Allison created PDFBOX-5166:
---

 Summary: Implement RichMedia annotation
 Key: PDFBOX-5166
 URL: https://issues.apache.org/jira/browse/PDFBOX-5166
 Project: PDFBox
  Issue Type: Task
Reporter: Tim Allison
 Attachments: testFlashInPDF.pdf

See TIKA-3359.  The attached file as an embedded Flash/swf file.  Tika is not 
currently extracting the embedded file.

In the debugger, I can see the Annotation as a PDAnnotationUnknown.  In the 
COSDictionary, I can see the subtype is "RichMedia".  If someone has the time, 
it'd be great to implement this so that we can extract more attachments in 
Tika...  Obv, others may find use too. :D

Many thanks to Tyler Thorsted for the test file and many thanks to 
@terminalboredom and @beet_keeper.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox

2021-04-16 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322688#comment-17322688
 ] 

Maruan Sahyoun commented on PDFBOX-5165:


[~tilman] thanks for taking care of that. Would we need to cut a release so 
that Tika can use the update to jempbox?

> Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in 
> JempBox
> ---
>
> Key: PDFBOX-5165
> URL: https://issues.apache.org/jira/browse/PDFBOX-5165
> Project: PDFBox
>  Issue Type: Task
>  Components: JempBox
>Affects Versions: 1.8.16
>Reporter: Tim Allison
>Assignee: Tilman Hausherr
>Priority: Trivial
>  Labels: optimization
> Fix For: 1.8.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Issue Comment Deleted] (PDFBOX-5163) Stack overflow when reading a corrupt dictionary

2021-04-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-5163:
---
Comment: was deleted

(was: Commit 114 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r114 ]

PDFBOX-5163: remove unused class)

> Stack overflow when reading a corrupt dictionary
> 
>
> Key: PDFBOX-5163
> URL: https://issues.apache.org/jira/browse/PDFBOX-5163
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.23, 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: crash_stack_overflow_sample.pdf
>
>
> Richard Smith/Chaoyuan Peng reported an issue with the current version 
> 2.0.23. When parsing a carefully handcrafted pdf the following exception 
> occurs and PDFBox crashes:
> {code}
> java.lang.StackOverflowError: null
> java.util.WeakHashMap.eq(Unknown Source)
> java.util.WeakHashMap.get(Unknown Source)
> java.util.Collections$SynchronizedMap.get(Unknown Source)
> org.apache.pdfbox.debugger.ui.LogDialog.log(LogDialog.java:143)
> org.apache.pdfbox.debugger.ui.DebugLog.warn(DebugLog.java:156)
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:933)
> 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
> 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
> 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:859)
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917)
> 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886)
> 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806)
> org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1060)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5143) Refactor/Simplify CFF parsing

2021-04-16 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322646#comment-17322646
 ] 

Andreas Lehmkühler commented on PDFBOX-5143:



Commit 114 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r114 ]

PDFBOX-5143: remove unused class

> Refactor/Simplify CFF parsing
> -
>
> Key: PDFBOX-5143
> URL: https://issues.apache.org/jira/browse/PDFBOX-5143
> Project: PDFBox
>  Issue Type: Improvement
>  Components: FontBox
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
>
> The classes used for the parsing of CFF-based fonts have some room for 
> improvements w.r.t. the memory footprint, the complexity of the code and the 
> test coverage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5163) Stack overflow when reading a corrupt dictionary

2021-04-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322639#comment-17322639
 ] 

ASF subversion and git services commented on PDFBOX-5163:
-

Commit 114 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r114 ]

PDFBOX-5163: remove unused class

> Stack overflow when reading a corrupt dictionary
> 
>
> Key: PDFBOX-5163
> URL: https://issues.apache.org/jira/browse/PDFBOX-5163
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.23, 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: crash_stack_overflow_sample.pdf
>
>
> Richard Smith/Chaoyuan Peng reported an issue with the current version 
> 2.0.23. When parsing a carefully handcrafted pdf the following exception 
> occurs and PDFBox crashes:
> {code}
> java.lang.StackOverflowError: null
> java.util.WeakHashMap.eq(Unknown Source)
> java.util.WeakHashMap.get(Unknown Source)
> java.util.Collections$SynchronizedMap.get(Unknown Source)
> org.apache.pdfbox.debugger.ui.LogDialog.log(LogDialog.java:143)
> org.apache.pdfbox.debugger.ui.DebugLog.warn(DebugLog.java:156)
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:933)
> 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
> 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
> 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:859)
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917)
> 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886)
> 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806)
> org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1060)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Apache PDFBox Board Report April 2021 due

2021-04-16 Thread Andreas Lehmkuehler

Am 14.04.21 um 19:30 schrieb Tilman Hausherr:

+1

You could mention that there is a instructer at Wright State University who uses 
PDFBox in his class as starting point

https://github.com/erikbuck/pdfbox/blob/patch-1/SRS%20(Requirements%20Document)
https://github.com/erikbuck/pdfbox/blob/patch-1/Software%20Design%20Document

Thanks, good point, I've added that detail to the report

Andreas



Tilman

Am 14.04.2021 um 08:29 schrieb Andreas Lehmkuehler:

Hi,

find attached a quick draft of the board report we're expected to submit this
month. It's based upon the report wizard template which can be found at [1]

Any comments or additions are appreciated ...



## Description:
The mission of PDFBox is the creation and maintenance of software related to
Java library for working with PDF documents

## Issues:
There are no issues requiring board attention at this time.

Some bugs were reported via secur...@apache.org and 2 of them ended up in a
CVE. Both were solved in 2.0.23.

- CVE-2021-27906 Apache PDFBox: a carefully crafted PDF file can trigger an
 OutOfMemory-Exception while loading the file
- CVE-2021-27807 Apache PDFBox: a carefully crafted PDF file can trigger an
 infinite loop while loading the file

The credits goes to Fabian Meumertzheim who found this issues when working on
OSS-Fuzz

## Membership Data:
Apache PDFBox was founded 2009-10-21 (11 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Matthäus Mayer on 2017-10-16.
- No new committers. Last addition was Joerg O. Henne on 2017-10-09.

## Project Activity:
Recent releases:

   2.0.23 was released on 2021-03-18.
   2.0.22 was released on 2020-12-19.
   2.0.21 was released on 2020-08-20.

## Community Health:
- there is a steady stream of contributions, bug reports and questions on the
 mailing lists
- there are a lot of refactorings, improvements and bugfixes
- the first alpha version of the upcoming new major release 3.0.0 was released
- some of the downstream projects already started to integrate the new release
 into their codebases. The feedback is positive so far.



Andreas

[1] https://reporter.apache.org/wizard/?pdfbox

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org