[jira] [Commented] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer
[ https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241581#comment-16241581 ] ASF subversion and git services commented on PDFBOX-3994: - Commit 1814460 from [~lehmi] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1814460 ] PDFBOX-3994: avoid ClassCastException > ClassCastException in COSParser.bfSearchForTrailer > -- > > Key: PDFBOX-3994 > URL: https://issues.apache.org/jira/browse/PDFBOX-3994 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.7, 2.0.8 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.9, 3.0.0 > > Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf > > > {code} > java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast > to org.apache.pdfbox.cos.COSObject > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668) > org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110) > org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246) > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189) > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) > {code} > This worked in 2.0.6. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer
[ https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-3994. Resolution: Fixed I've added the missing class check. > ClassCastException in COSParser.bfSearchForTrailer > -- > > Key: PDFBOX-3994 > URL: https://issues.apache.org/jira/browse/PDFBOX-3994 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.7, 2.0.8 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.9, 3.0.0 > > Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf > > > {code} > java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast > to org.apache.pdfbox.cos.COSObject > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668) > org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110) > org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246) > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189) > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) > {code} > This worked in 2.0.6. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer
[ https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241580#comment-16241580 ] ASF subversion and git services commented on PDFBOX-3994: - Commit 1814459 from [~lehmi] in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1814459 ] PDFBOX-3994: avoid ClassCastException > ClassCastException in COSParser.bfSearchForTrailer > -- > > Key: PDFBOX-3994 > URL: https://issues.apache.org/jira/browse/PDFBOX-3994 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.7, 2.0.8 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.9, 3.0.0 > > Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf > > > {code} > java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast > to org.apache.pdfbox.cos.COSObject > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668) > org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110) > org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246) > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189) > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) > {code} > This worked in 2.0.6. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-3993. --- Resolution: Cannot Reproduce Ok, I am closing the issue for now. You can reopen it if you can reproduce the problem. Make sure that your server isn't busy with something else that slows everything down. And don't forget to update to 2.0.8. > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241003#comment-16241003 ] Ace commented on PDFBOX-3993: - OK, when I run it locally, it also only take about a sec to finish, but once run it in remote server, the slower time issue sometimes comes out, I will keep trying to re-produce this issue. > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240920#comment-16240920 ] Tilman Hausherr commented on PDFBOX-3993: - Here's the code I used: {code} try (PDDocument doc = PDDocument.load(new File("sample-pdf.pdf"))) { PDFTextStripper stripper = new PDFTextStripper(); StringWriter sw = new StringWriter(); stripper.writeText(doc, sw); System.out.println(sw.getBuffer()); } {code} It runs in a second. > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer
[ https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-3994: --- Fix Version/s: 3.0.0 2.0.9 > ClassCastException in COSParser.bfSearchForTrailer > -- > > Key: PDFBOX-3994 > URL: https://issues.apache.org/jira/browse/PDFBOX-3994 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.7, 2.0.8 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.9, 3.0.0 > > Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf > > > {code} > java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast > to org.apache.pdfbox.cos.COSObject > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668) > org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110) > org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246) > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189) > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) > {code} > This worked in 2.0.6. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer
[ https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-3994: -- Assignee: Andreas Lehmkühler > ClassCastException in COSParser.bfSearchForTrailer > -- > > Key: PDFBOX-3994 > URL: https://issues.apache.org/jira/browse/PDFBOX-3994 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.7, 2.0.8 >Reporter: Tilman Hausherr >Assignee: Andreas Lehmkühler > Labels: regression > Fix For: 2.0.9, 3.0.0 > > Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf > > > {code} > java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast > to org.apache.pdfbox.cos.COSObject > > org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668) > org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110) > org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246) > org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189) > org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) > {code} > This worked in 2.0.6. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Attachment: (was: sample-pdf.png) > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Attachment: sample-pdf.pdf OK, got it! > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Attachment: (was: sample-pdf.pdf.png) > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240869#comment-16240869 ] Tilman Hausherr commented on PDFBOX-3993: - You can remove attachments and comments yourself, click on the garbage bin symbol. Tp attach a file, click on "More" (on the area that has with "edit", "comment", "assign", "more", "resolve issue", "close issue"), "Attach files". > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240861#comment-16240861 ] Ace commented on PDFBOX-3993: - I don't know why I cannot upload it as pdf, but I try to remove the .png from this file and then it is a pdf, can you please try to remove .png from the file I just upload(the "sample-pdf.pdf.png")? > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240859#comment-16240859 ] Ace commented on PDFBOX-3993: - !sample-pdf.pdf|thumbnail! > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Issue Comment Deleted] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Comment: was deleted (was: !sample-pdf.pdf|thumbnail!) > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Attachment: sample-pdf.pdf.png > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Issue Comment Deleted] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Comment: was deleted (was: !sample-pdf.pdf!) > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240857#comment-16240857 ] Ace commented on PDFBOX-3993: - !sample-pdf.pdf! > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.pdf.png, sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240856#comment-16240856 ] Ace commented on PDFBOX-3993: - Oh, it is a pdf in my local, I copy and paste it into the comment field and then it converts to a png... > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-3996) Add option to automatically create just one file with Splitter
Gilad Denneboom created PDFBOX-3996: --- Summary: Add option to automatically create just one file with Splitter Key: PDFBOX-3996 URL: https://issues.apache.org/jira/browse/PDFBOX-3996 Project: PDFBox Issue Type: Improvement Components: Utilities Affects Versions: 2.0.8 Reporter: Gilad Denneboom Priority: Minor When using Splitter, its default behavior is to extract each page in the selected range as a separate PDDocument, but often one wants to extract that range as a single document. In order to do that you currently need to use the setSplitAtPage method and set it to the number of pages in your range, which is a bit silly. It would be better if there was an option to do that automatically, for example by setting the splitAtPage property to 0, or something like that, and have it automatically calculate the number of pages based on the startPage and endPage values. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240820#comment-16240820 ] Tilman Hausherr commented on PDFBOX-3993: - What you uploaded is a .png file. > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ace updated PDFBOX-3993: Attachment: sample-pdf.png > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > Attachments: sample-pdf.png > > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Running tika-eval on the Rackspace vm
I think I was successful, the report now makes sense, as if Tim had created it himself :-) The two issues I just created are related to a comparison between 2.0.8 and 2.0.4. So for that next board report, we can now (additional to the existing text) tell that there is now a second committer who can run the tests. Tilman Am 05.11.2017 um 22:06 schrieb Tilman Hausherr: I've come closer to find out what's happening. I found out that tika-app was running with PDFBox 2.0.7 all the time regardless of what pdfbox version is in the pom.xml. Apparently, building tika-app uses tika-parsers from the repository (instead building tika-parsers it again), which needs 2.0.7. Explicitely building tika-parsers before building tika-app helps. This is new to me, in PDFBox if one builds the app all dependencies are built as well. Tilman Am 04.11.2017 um 14:48 schrieb Tilman Hausherr: So it's done: /work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017 I wonder why the differences are so few, especially in meta where I KNOW that there are differences, due to the handling of empty strings with BOM. Maybe it is because I skipped the "A" phase and used existing data from a 2.0.4 run that I found, or because I use a current tika trunk and not the existing binary that was on the server. I'm thinking of creating a new "A" with 2.0.4 with current tika trunk and then compare with the "B" I did. Tilman Am 03.11.2017 um 22:14 schrieb Tilman Hausherr: Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.: I'm not sure what you mean by...sorry - "H" is missing, which is identical to "C" I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. Of course it is obvious that it has to be done, but I am a perfectionist. I'd like to have this documentation for the "me" in a few months when I have forgotten what I did the last days. Or for the next person. Thanks for the fixes you did. I wonder why writing to /tmp didn't work - it did work from the command line. I've started the command again, I'm not sure when I will report about it. I'm a bit exhausted from non-software activities :-( Tilman - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-3995) IllegalArgumentException: root cannot be null with truncated file (3)
Tilman Hausherr created PDFBOX-3995: --- Summary: IllegalArgumentException: root cannot be null with truncated file (3) Key: PDFBOX-3995 URL: https://issues.apache.org/jira/browse/PDFBOX-3995 Project: PDFBox Issue Type: Bug Affects Versions: 2.0.8 Reporter: Tilman Hausherr Attachments: PDFBOX-3995-HU47TMUIW72SB4WR2UBJWJ4CY5IFCR3X.pdf {code} java.lang.IllegalArgumentException: root cannot be null org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75) org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129) org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1389) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer
Tilman Hausherr created PDFBOX-3994: --- Summary: ClassCastException in COSParser.bfSearchForTrailer Key: PDFBOX-3994 URL: https://issues.apache.org/jira/browse/PDFBOX-3994 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 2.0.8, 2.0.7 Reporter: Tilman Hausherr Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf {code} java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSObject org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668) org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110) org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246) org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189) org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240) {code} This worked in 2.0.6. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow
[ https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240731#comment-16240731 ] Tilman Hausherr commented on PDFBOX-3993: - The current version is 2.0.8. Please attach the PDF. > PDFTextStripper.writeText is slow > - > > Key: PDFBOX-3993 > URL: https://issues.apache.org/jira/browse/PDFBOX-3993 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.4 >Reporter: Ace > > I see this problem has been posted about two years before, and the poster > said it has been fixed in 2.0.0 version, but now I face this problem when I > try to parse a pdf file with image in it. I also add logs before and after > PDFTextStripper.writeText to mark the time spend, it looks like it takes > quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file > with image in it). The running environment is in our production server but > not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-3993) PDFTextStripper.writeText is slow
Ace created PDFBOX-3993: --- Summary: PDFTextStripper.writeText is slow Key: PDFBOX-3993 URL: https://issues.apache.org/jira/browse/PDFBOX-3993 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 2.0.4 Reporter: Ace I see this problem has been posted about two years before, and the poster said it has been fixed in 2.0.0 version, but now I face this problem when I try to parse a pdf file with image in it. I also add logs before and after PDFTextStripper.writeText to mark the time spend, it looks like it takes quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file with image in it). The running environment is in our production server but not local. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3988) Performance issue when rendering first page of PDF
[ https://issues.apache.org/jira/browse/PDFBOX-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240629#comment-16240629 ] Tilman Hausherr commented on PDFBOX-3988: - I turned on the profiler again and saw that LCMS was used and that colorspaces took long to initialize (and LCMS is slow anyway). So insert all these after the "main" line: {code} System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider"); // should be first PDFont font = PDType1Font.COURIER; PDDeviceCMYK.INSTANCE.toRGB(new float[]{0,0,0,0}); PDDeviceRGB.INSTANCE.toRGB(new float[]{0,0,0}); {code} The property is mentioned in "getting started": https://pdfbox.apache.org/2.0/getting-started.html Also change your pom to have the current version, which is 2.0.8 and not 2.0.6. I didn't notice this the first time because I had tested with PDFDebugger instead of your project and my local version had the property set. Mea culpa! > Performance issue when rendering first page of PDF > -- > > Key: PDFBOX-3988 > URL: https://issues.apache.org/jira/browse/PDFBOX-3988 > Project: PDFBox > Issue Type: Improvement > Components: Rendering >Affects Versions: 2.0.7 >Reporter: Maciej Matecki > Attachments: calls.png > > > Let say that you want to generate PNG for all pages in PDF. > The generation of the first page is really slow, the second one is quite fast. > For example test PDF contains two same PDF pages. > First page renders in: ~2000ms > Second one: ~220 ms > It looks like for the first page (inv.0) there's 359 ms overhead just for > creating the font. > [^calls.png] > Tried to use other library to perform the same operation with the same pdf > file and I was able to retrieve BufferedImage of the first (slower) page in > ~750 ms. > For the second page the result was almost the same like for the PDFBox. > It looks like there's the place to improve performance when starting > rendering the PDF page. > Or do you have any advices how to improve performance in the example project? > Test project: https://github.com/mmatecki/pdfboxperformance -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-3988) Performance issue when rendering first page of PDF
[ https://issues.apache.org/jira/browse/PDFBOX-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240143#comment-16240143 ] Maciej Matecki commented on PDFBOX-3988: Thanks. That allows me to save a few hundred ms. There's still big difference between first and the second page: {code} Page 0 rendered in 1204 ms Page 1 rendered in 149 ms {code} Above that's the pdf file where two pages are the same. So let's say that something is cached and the second page is much faster. However, if I test the pdf where each page is different results look like that: {code} Page 0 rendered in 1172 ms Page 1 rendered in 763 ms Page 2 rendered in 208 ms Page 3 rendered in 56 ms Page 4 rendered in 78 ms Page 5 rendered in 161 ms {code} The interesting thing in the last commit to my small benchmark project there're two pdfs rendered: {code} try (PDDocument document = getPDFDocument("/test-pdf-18.pdf")) { renderPages(document, dpi); System.out.println("---"); } catch (Exception e) { System.out.println("Ups. " + e.getMessage()); } try (PDDocument document = getPDFDocument("/test-2p.pdf")) { renderPages(document, dpi); System.out.println("---"); } catch (Exception e) { System.out.println("Ups. " + e.getMessage()); } {code} And the results looks like that: {code} Page 0 rendered in 1083 ms Page 1 rendered in 718 ms Page 2 rendered in 185 ms Page 3 rendered in 64 ms Page 4 rendered in 82 ms Page 5 rendered in 164 ms Page 6 rendered in 1216 ms Page 7 rendered in 184 ms Page 8 rendered in 104 ms Page 9 rendered in 99 ms Page 10 rendered in 112 ms --- Page 0 rendered in 123 ms Page 1 rendered in 113 ms --- {code} So the first page of the second PDF is rendered really quickly. Is there something more to initialize earlier so later results are better? > Performance issue when rendering first page of PDF > -- > > Key: PDFBOX-3988 > URL: https://issues.apache.org/jira/browse/PDFBOX-3988 > Project: PDFBox > Issue Type: Improvement > Components: Rendering >Affects Versions: 2.0.7 >Reporter: Maciej Matecki > Attachments: calls.png > > > Let say that you want to generate PNG for all pages in PDF. > The generation of the first page is really slow, the second one is quite fast. > For example test PDF contains two same PDF pages. > First page renders in: ~2000ms > Second one: ~220 ms > It looks like for the first page (inv.0) there's 359 ms overhead just for > creating the font. > [^calls.png] > Tried to use other library to perform the same operation with the same pdf > file and I was able to retrieve BufferedImage of the first (slower) page in > ~750 ms. > For the second page the result was almost the same like for the PDFBox. > It looks like there's the place to improve performance when starting > rendering the PDF page. > Or do you have any advices how to improve performance in the example project? > Test project: https://github.com/mmatecki/pdfboxperformance -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org