[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324082#comment-17324082 ] Tim Allison commented on PDFBOX-5166: - Ha @bitsgalore has an example of subtype=Screen. Yay! https://twitter.com/_tallison/status/1383164998629924870?s=20 > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature > Components: PDModel >Reporter: Tim Allison >Priority: Minor > Labels: Annotations > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: PDFBox 3.0.0-SNAPSHOT reports
Hi All, I reran 2.0.23 with our added handling for flash files against the 3.0.0-SNAPSHOT that I ran yesterday. The diffs look almost the same as the reports I created yesterday, so I think those are accurate: https://corpora.tika.apache.org/base/reports/pdfbox-2.0.23-richmedia.tgz There are a handful of files that "lose" attachments going into 3.0.0-SNAPSHOT because I haven't added the richmedia handling in our 3.0.0 branch. Best, Tim On Thu, Apr 15, 2021 at 7:15 PM Tim Allison wrote: > > Diffs look suspiciously small...I may have to rerun the analyses. > > On Thu, Apr 15, 2021 at 7:08 PM Tim Allison wrote: > > > > Latest here: > > https://corpora.tika.apache.org/base/reports/pdfbox-3.0.0-20210415_reports.tgz > > > > I haven't had a chance to look yet. Will dig in tomorrow. - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324048#comment-17324048 ] Tim Allison commented on PDFBOX-5166: - Are those also streams in subtype=RichMedia or do we need to look for other subtypes? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature > Components: PDModel >Reporter: Tim Allison >Priority: Minor > Labels: Annotations > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5166: Labels: Annotations (was: ) > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Labels: Annotations > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5166: Component/s: PDModel > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature > Components: PDModel >Reporter: Tim Allison >Priority: Minor > Labels: Annotations > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324005#comment-17324005 ] Maruan Sahyoun edited comment on PDFBOX-5166 at 4/16/21, 6:16 PM: -- Yes there is - multimedia content such as sound or video and there is 3D content. And there are collections. At the end of the day most boil down to being streams but I'm not sure if you detect and extract them. was (Author: msahyoun): Yes there is - multimedia content such as sound or video and there is 3D content. > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324005#comment-17324005 ] Maruan Sahyoun commented on PDFBOX-5166: Yes there is - multimedia content such as sound or video and there is 3D content. > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324002#comment-17324002 ] Tim Allison edited comment on PDFBOX-5166 at 4/16/21, 6:07 PM: --- Extraction only, yes...for our purposes on Tika, we wouldn't have any need to add or modify. I'm ok with Tilman's example code for now, but I worry that we'll likely come across some required special handling that it would be better to have in PDFBox. This isn't high priority, and I don't see a need to backport to 2.x. Separate topic...I'm wondering now if there are other annotation types that might conceal embedded files? was (Author: talli...@mitre.org): Extraction only, yes...for our purposes on Tika, we wouldn't have any need to add or modify. I'm ok with Tilman's example code for now, but I worry that we'll likely come across some required special handling that'd it would be better to have in PDFBox. This isn't high priority, and I don't see a need to backport to 2.x. Separate topic...I'm wondering now if there are other annotation types that might conceal embedded files? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324002#comment-17324002 ] Tim Allison commented on PDFBOX-5166: - Extraction only, yes...for our purposes on Tika, we wouldn't have any need to add or modify. I'm ok with Tilman's example code for now, but I worry that we'll likely come across some required special handling that'd it would be better to have in PDFBox. This isn't high priority, and I don't see a need to backport to 2.x. Separate topic...I'm wondering now if there are other annotation types that might conceal embedded files? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323999#comment-17323999 ] Maruan Sahyoun commented on PDFBOX-5166: Would it be enough for your purpose to implement the bits to being able to extract the Assets? > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5164) Create portable collection PDF
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937 ] Tilman Hausherr edited comment on PDFBOX-5164 at 4/16/21, 5:20 PM: --- Here's some code... there are not yet any collection classes in PDFBox. Get the EmbeddedFiles.java example from the source code and then add this near the end. This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3); doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, collectionDic); doc.getDocumentCatalog().setVersion("1.7"); COSDictionary ciDict1 = new COSDictionary(); ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM); // use the field names from earlier ciDict1.setString("fieldone", "Very interesting file"); ciDict1.setString("fieldtwo", fs.getFile()); ciDict1.setInt("fieldthree", 333); fs.getCOSObject().setItem(COSName.CI, ciDict1); {code} Use the latest snapshot in https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/ to get the new constants. Result file: [^collection.pdf] was (Author: tilman): Here's some code... there are not yet any collection classes in PDFBox. Get the EmbeddedFiles.java example from the source code and then add this near the end. This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3);
[jira] [Updated] (PDFBOX-5164) Create portable collection PDF
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5164: Summary: Create portable collection PDF (was: out version is already 2.0.23 . I want customize the colums myself in the red box use java.) > Create portable collection PDF > -- > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Priority: Major > Fix For: 2.0.24, 4.0.0 > > Attachments: MakePackage.java, collection.pdf, > image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937 ] Tilman Hausherr edited comment on PDFBOX-5164 at 4/16/21, 4:42 PM: --- Here's some code... there are not yet any collection classes in PDFBox. Get the EmbeddedFiles.java example from the source code and then add this near the end. This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3); doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, collectionDic); doc.getDocumentCatalog().setVersion("1.7"); COSDictionary ciDict1 = new COSDictionary(); ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM); // use the field names from earlier ciDict1.setString("fieldone", "Very interesting file"); ciDict1.setString("fieldtwo", fs.getFile()); ciDict1.setInt("fieldthree", 333); fs.getCOSObject().setItem(COSName.CI, ciDict1); {code} Use the latest snapshot in https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/ (it's building right now) to get the new constants. Result file: [^collection.pdf] was (Author: tilman): Here's some code... there are not yet any collection classes in PDFBox. Get the EmbeddedFiles.java example from the source code and then add this near the end. This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3);
[jira] [Comment Edited] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937 ] Tilman Hausherr edited comment on PDFBOX-5164 at 4/16/21, 4:41 PM: --- Here's some code... there are not yet any collection classes in PDFBox. Get the EmbeddedFiles.java example from the source code and then add this near the end. This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3); doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, collectionDic); doc.getDocumentCatalog().setVersion("1.7"); COSDictionary ciDict1 = new COSDictionary(); ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM); // use the field names from earlier ciDict1.setString("fieldone", "Very interesting file"); ciDict1.setString("fieldtwo", fs.getFile()); ciDict1.setInt("fieldthree", 333); fs.getCOSObject().setItem(COSName.CI, ciDict1); {code} Use the latest snapshot in https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/ (it's building right now) to get the new constants was (Author: tilman): This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3); doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, collectionDic); doc.getDocumentCatalog().setVersion("1.7"); COSDictionary ciDict1 = new COSDictionary(); ciDict1.setItem(COSName.TYPE,
[jira] [Commented] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323937#comment-17323937 ] Tilman Hausherr commented on PDFBOX-5164: - This is about a single file but you can add several ones. For that, create more "ciDict" dictionaries and of course more PDComplexFileSpecification objects. {code} COSDictionary collectionDic = new COSDictionary(); COSDictionary schemaDict = new COSDictionary(); schemaDict.setItem(COSName.TYPE, COSName.COLLECTION_SCHEMA); COSDictionary sortDic = new COSDictionary(); sortDic.setItem(COSName.TYPE, COSName.COLLECTION_SORT); sortDic.setString(COSName.A, "true"); // sort ascending sortDic.setItem(COSName.S, COSName.getPDFName("fieldtwo")); // "it identifies a field described in the parent collection dictionary" collectionDic.setItem(COSName.TYPE, COSName.COLLECTION); collectionDic.setItem(COSName.SCHEMA, schemaDict); collectionDic.setItem(COSName.SORT, sortDic); collectionDic.setItem(COSName.VIEW, COSName.D); // Details mode COSDictionary fieldDict1 = new COSDictionary(); fieldDict1.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict1.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict1.setString(COSName.N, "field header one"); // header text fieldDict1.setInt(COSName.O, 1); // order on the screen COSDictionary fieldDict2 = new COSDictionary(); fieldDict2.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict2.setItem(COSName.SUBTYPE, COSName.S); // type: text field fieldDict2.setString(COSName.N, "field header two"); fieldDict2.setInt(COSName.O, 2); COSDictionary fieldDict3 = new COSDictionary(); fieldDict3.setItem(COSName.TYPE, COSName.COLLECTION_FIELD); fieldDict3.setItem(COSName.SUBTYPE, COSName.N); // type: number field fieldDict3.setString(COSName.N, "field header three"); fieldDict3.setInt(COSName.O, 3); schemaDict.setItem("fieldone", fieldDict1); // field name (this is a key) schemaDict.setItem("fieldtwo", fieldDict2); schemaDict.setItem("fieldthree", fieldDict3); doc.getDocumentCatalog().getCOSObject().setItem(COSName.COLLECTION, collectionDic); doc.getDocumentCatalog().setVersion("1.7"); COSDictionary ciDict1 = new COSDictionary(); ciDict1.setItem(COSName.TYPE, COSName.COLLECTION_ITEM); // use the field names from earlier ciDict1.setString("fieldone", "Very interesting file"); ciDict1.setString("fieldtwo", fs.getFile()); ciDict1.setInt("fieldthree", 333); fs.getCOSObject().setItem(COSName.CI, ciDict1); {code} Use the latest snapshot in https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.24-SNAPSHOT/ (it's building right now) to get the new constants > out version is already 2.0.23 . I want customize the colums myself in the > red box use java. > - > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Priority: Major > Fix For: 2.0.24, 4.0.0 > > Attachments: MakePackage.java, collection.pdf, > image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323935#comment-17323935 ] ASF subversion and git services commented on PDFBOX-5164: - Commit 130 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r130 ] PDFBOX-5164: add constants for collections > out version is already 2.0.23 . I want customize the colums myself in the > red box use java. > - > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Priority: Major > Fix For: 2.0.24, 4.0.0 > > Attachments: MakePackage.java, collection.pdf, > image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323934#comment-17323934 ] ASF subversion and git services commented on PDFBOX-5164: - Commit 129 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r129 ] PDFBOX-5164: add constants for collections > out version is already 2.0.23 . I want customize the colums myself in the > red box use java. > - > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Priority: Major > Fix For: 2.0.24, 4.0.0 > > Attachments: MakePackage.java, collection.pdf, > image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5164) out version is already 2.0.23 . I want customize the colums myself in the red box use java.
[ https://issues.apache.org/jira/browse/PDFBOX-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5164: Attachment: collection.pdf > out version is already 2.0.23 . I want customize the colums myself in the > red box use java. > - > > Key: PDFBOX-5164 > URL: https://issues.apache.org/jira/browse/PDFBOX-5164 > Project: PDFBox > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.18 > Environment: java >Reporter: zhouxiaolong >Priority: Major > Fix For: 2.0.24, 4.0.0 > > Attachments: MakePackage.java, collection.pdf, > image-2021-04-15-16-02-42-451.png, screenshot-1.png, viewfiles - 副本.pdf > > > !image-2021-04-15-16-02-42-451.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5166: Issue Type: New Feature (was: Task) > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: New Feature >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323831#comment-17323831 ] Tim Allison edited comment on PDFBOX-5165 at 4/16/21, 1:52 PM: --- Thank you for the quick fix! Unless there are needs on other projects, we have no immediate need on the Tika side. Let's wait a bit to see if anything else falls out of the regression tests with PDFBox 3.0.0-SNAPSHOT. At some point, it would be great to have an updated jempbox for this issue and also for the rare date/time concurrency issue. was (Author: talli...@mitre.org): Unless there are needs on other projects, we have no immediate need on the Tika side. Let's wait a bit to see if anything else falls out of the regression tests with PDFBox 3.0.0-SNAPSHOT. At some point, it would be great to have an updated jempbox for this issue and also for the rare date/time concurrency issue. > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task > Components: JempBox >Affects Versions: 1.8.16 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Labels: optimization > Fix For: 1.8.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323831#comment-17323831 ] Tim Allison commented on PDFBOX-5165: - Unless there are needs on other projects, we have no immediate need on the Tika side. Let's wait a bit to see if anything else falls out of the regression tests with PDFBox 3.0.0-SNAPSHOT. At some point, it would be great to have an updated jempbox for this issue and also for the rare date/time concurrency issue. > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task > Components: JempBox >Affects Versions: 1.8.16 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Labels: optimization > Fix For: 1.8.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated PDFBOX-5166: Priority: Minor (was: Major) > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5166) Implement RichMedia annotation
[ https://issues.apache.org/jira/browse/PDFBOX-5166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323809#comment-17323809 ] Tim Allison commented on PDFBOX-5166: - Completely unsurprisingly, [~tilman] has already shown how to extract these files on SO: https://stackoverflow.com/questions/45460027/what-is-the-best-way-to-extract-embedded-flash-file-from-a-pdf-using-the-pdfbox If this is a "not going to fix", no problem! I'm happy to put that code into Tika for now, and if a RichMedia annotation gets implemented in PDFBox, I can update our code accordingly. > Implement RichMedia annotation > -- > > Key: PDFBOX-5166 > URL: https://issues.apache.org/jira/browse/PDFBOX-5166 > Project: PDFBox > Issue Type: Task >Reporter: Tim Allison >Priority: Major > Attachments: testFlashInPDF.pdf > > > See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not > currently extracting the embedded file. > In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the > COSDictionary, I can see the subtype is "RichMedia". If someone has the > time, it'd be great to implement this so that we can extract more attachments > in Tika... Obv, others may find use too. :D > Many thanks to Tyler Thorsted for the test file and many thanks to > @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5166) Implement RichMedia annotation
Tim Allison created PDFBOX-5166: --- Summary: Implement RichMedia annotation Key: PDFBOX-5166 URL: https://issues.apache.org/jira/browse/PDFBOX-5166 Project: PDFBox Issue Type: Task Reporter: Tim Allison Attachments: testFlashInPDF.pdf See TIKA-3359. The attached file as an embedded Flash/swf file. Tika is not currently extracting the embedded file. In the debugger, I can see the Annotation as a PDAnnotationUnknown. In the COSDictionary, I can see the subtype is "RichMedia". If someone has the time, it'd be great to implement this so that we can extract more attachments in Tika... Obv, others may find use too. :D Many thanks to Tyler Thorsted for the test file and many thanks to @terminalboredom and @beet_keeper. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5165) Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in JempBox
[ https://issues.apache.org/jira/browse/PDFBOX-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322688#comment-17322688 ] Maruan Sahyoun commented on PDFBOX-5165: [~tilman] thanks for taking care of that. Would we need to cut a release so that Tika can use the update to jempbox? > Exceedingly slow processing of XMPSchemaMediaManagement's getHistory in > JempBox > --- > > Key: PDFBOX-5165 > URL: https://issues.apache.org/jira/browse/PDFBOX-5165 > Project: PDFBox > Issue Type: Task > Components: JempBox >Affects Versions: 1.8.16 >Reporter: Tim Allison >Assignee: Tilman Hausherr >Priority: Trivial > Labels: optimization > Fix For: 1.8.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Issue Comment Deleted] (PDFBOX-5163) Stack overflow when reading a corrupt dictionary
[ https://issues.apache.org/jira/browse/PDFBOX-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-5163: --- Comment: was deleted (was: Commit 114 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r114 ] PDFBOX-5163: remove unused class) > Stack overflow when reading a corrupt dictionary > > > Key: PDFBOX-5163 > URL: https://issues.apache.org/jira/browse/PDFBOX-5163 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.23, 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.24, 3.0.0 PDFBox > > Attachments: crash_stack_overflow_sample.pdf > > > Richard Smith/Chaoyuan Peng reported an issue with the current version > 2.0.23. When parsing a carefully handcrafted pdf the following exception > occurs and PDFBox crashes: > {code} > java.lang.StackOverflowError: null > java.util.WeakHashMap.eq(Unknown Source) > java.util.WeakHashMap.get(Unknown Source) > java.util.Collections$SynchronizedMap.get(Unknown Source) > org.apache.pdfbox.debugger.ui.LogDialog.log(LogDialog.java:143) > org.apache.pdfbox.debugger.ui.DebugLog.warn(DebugLog.java:156) > org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:933) > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:859) > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917) > > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886) > > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806) > org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1060) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5143) Refactor/Simplify CFF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-5143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322646#comment-17322646 ] Andreas Lehmkühler commented on PDFBOX-5143: Commit 114 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r114 ] PDFBOX-5143: remove unused class > Refactor/Simplify CFF parsing > - > > Key: PDFBOX-5143 > URL: https://issues.apache.org/jira/browse/PDFBOX-5143 > Project: PDFBox > Issue Type: Improvement > Components: FontBox >Affects Versions: 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > > The classes used for the parsing of CFF-based fonts have some room for > improvements w.r.t. the memory footprint, the complexity of the code and the > test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5163) Stack overflow when reading a corrupt dictionary
[ https://issues.apache.org/jira/browse/PDFBOX-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322639#comment-17322639 ] ASF subversion and git services commented on PDFBOX-5163: - Commit 114 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r114 ] PDFBOX-5163: remove unused class > Stack overflow when reading a corrupt dictionary > > > Key: PDFBOX-5163 > URL: https://issues.apache.org/jira/browse/PDFBOX-5163 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.23, 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.24, 3.0.0 PDFBox > > Attachments: crash_stack_overflow_sample.pdf > > > Richard Smith/Chaoyuan Peng reported an issue with the current version > 2.0.23. When parsing a carefully handcrafted pdf the following exception > occurs and PDFBox crashes: > {code} > java.lang.StackOverflowError: null > java.util.WeakHashMap.eq(Unknown Source) > java.util.WeakHashMap.get(Unknown Source) > java.util.Collections$SynchronizedMap.get(Unknown Source) > org.apache.pdfbox.debugger.ui.LogDialog.log(LogDialog.java:143) > org.apache.pdfbox.debugger.ui.DebugLog.warn(DebugLog.java:156) > org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:933) > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154) > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283) > > org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216) > org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:859) > org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917) > > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886) > > org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806) > org.apache.pdfbox.pdfparser.COSParser.getLength(COSParser.java:1060) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Apache PDFBox Board Report April 2021 due
Am 14.04.21 um 19:30 schrieb Tilman Hausherr: +1 You could mention that there is a instructer at Wright State University who uses PDFBox in his class as starting point https://github.com/erikbuck/pdfbox/blob/patch-1/SRS%20(Requirements%20Document) https://github.com/erikbuck/pdfbox/blob/patch-1/Software%20Design%20Document Thanks, good point, I've added that detail to the report Andreas Tilman Am 14.04.2021 um 08:29 schrieb Andreas Lehmkuehler: Hi, find attached a quick draft of the board report we're expected to submit this month. It's based upon the report wizard template which can be found at [1] Any comments or additions are appreciated ... ## Description: The mission of PDFBox is the creation and maintenance of software related to Java library for working with PDF documents ## Issues: There are no issues requiring board attention at this time. Some bugs were reported via secur...@apache.org and 2 of them ended up in a CVE. Both were solved in 2.0.23. - CVE-2021-27906 Apache PDFBox: a carefully crafted PDF file can trigger an OutOfMemory-Exception while loading the file - CVE-2021-27807 Apache PDFBox: a carefully crafted PDF file can trigger an infinite loop while loading the file The credits goes to Fabian Meumertzheim who found this issues when working on OSS-Fuzz ## Membership Data: Apache PDFBox was founded 2009-10-21 (11 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Matthäus Mayer on 2017-10-16. - No new committers. Last addition was Joerg O. Henne on 2017-10-09. ## Project Activity: Recent releases: 2.0.23 was released on 2021-03-18. 2.0.22 was released on 2020-12-19. 2.0.21 was released on 2020-08-20. ## Community Health: - there is a steady stream of contributions, bug reports and questions on the mailing lists - there are a lot of refactorings, improvements and bugfixes - the first alpha version of the upcoming new major release 3.0.0 was released - some of the downstream projects already started to integrate the new release into their codebases. The feedback is positive so far. Andreas [1] https://reporter.apache.org/wizard/?pdfbox - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org