[jira] [Commented] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer

2017-11-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241581#comment-16241581
 ] 

ASF subversion and git services commented on PDFBOX-3994:
-

Commit 1814460 from [~lehmi] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1814460 ]

PDFBOX-3994: avoid ClassCastException

> ClassCastException in COSParser.bfSearchForTrailer
> --
>
> Key: PDFBOX-3994
> URL: https://issues.apache.org/jira/browse/PDFBOX-3994
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.7, 2.0.8
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.9, 3.0.0
>
> Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf
>
>
> {code}
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSObject
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668)
> org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110)
> org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> {code}
> This worked in 2.0.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer

2017-11-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-3994.

Resolution: Fixed

I've added the missing class check.

> ClassCastException in COSParser.bfSearchForTrailer
> --
>
> Key: PDFBOX-3994
> URL: https://issues.apache.org/jira/browse/PDFBOX-3994
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.7, 2.0.8
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.9, 3.0.0
>
> Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf
>
>
> {code}
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSObject
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668)
> org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110)
> org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> {code}
> This worked in 2.0.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer

2017-11-06 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241580#comment-16241580
 ] 

ASF subversion and git services commented on PDFBOX-3994:
-

Commit 1814459 from [~lehmi] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1814459 ]

PDFBOX-3994: avoid ClassCastException

> ClassCastException in COSParser.bfSearchForTrailer
> --
>
> Key: PDFBOX-3994
> URL: https://issues.apache.org/jira/browse/PDFBOX-3994
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.7, 2.0.8
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.9, 3.0.0
>
> Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf
>
>
> {code}
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSObject
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668)
> org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110)
> org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> {code}
> This worked in 2.0.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Closed] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-3993.
---
Resolution: Cannot Reproduce

Ok, I am closing the issue for now. You can reopen it if you can reproduce the 
problem. Make sure that your server isn't busy with something else that slows 
everything down.

And don't forget to update to 2.0.8.

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241003#comment-16241003
 ] 

Ace commented on PDFBOX-3993:
-

OK, when I run it locally, it also only take about a sec to finish, but once 
run it in remote server, the slower time issue sometimes comes out, I will keep 
trying to re-produce this issue.

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240920#comment-16240920
 ] 

Tilman Hausherr commented on PDFBOX-3993:
-

Here's the code I used:
{code}
try (PDDocument doc = PDDocument.load(new File("sample-pdf.pdf")))
{
PDFTextStripper stripper = new PDFTextStripper();
StringWriter sw = new StringWriter();
stripper.writeText(doc, sw);
System.out.println(sw.getBuffer());
}
{code}
It runs in a second.

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer

2017-11-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-3994:
---
Fix Version/s: 3.0.0
   2.0.9

> ClassCastException in COSParser.bfSearchForTrailer
> --
>
> Key: PDFBOX-3994
> URL: https://issues.apache.org/jira/browse/PDFBOX-3994
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.7, 2.0.8
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.9, 3.0.0
>
> Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf
>
>
> {code}
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSObject
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668)
> org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110)
> org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> {code}
> This worked in 2.0.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Assigned] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer

2017-11-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-3994:
--

Assignee: Andreas Lehmkühler

> ClassCastException in COSParser.bfSearchForTrailer
> --
>
> Key: PDFBOX-3994
> URL: https://issues.apache.org/jira/browse/PDFBOX-3994
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.7, 2.0.8
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>  Labels: regression
> Fix For: 2.0.9, 3.0.0
>
> Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf
>
>
> {code}
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
> to org.apache.pdfbox.cos.COSObject
> 
> org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668)
> org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110)
> org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246)
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> {code}
> This worked in 2.0.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Attachment: (was: sample-pdf.png)

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Attachment: sample-pdf.pdf

OK, got it!

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Attachment: (was: sample-pdf.pdf.png)

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240869#comment-16240869
 ] 

Tilman Hausherr commented on PDFBOX-3993:
-

You can remove attachments and comments yourself, click on the garbage bin 
symbol.

Tp attach a file, click on "More" (on the area that has with "edit", "comment", 
"assign", "more", "resolve issue", "close issue"), "Attach files".

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240861#comment-16240861
 ] 

Ace commented on PDFBOX-3993:
-

I don't know why I cannot upload it as pdf, but I try to remove the .png from 
this file and then it is a pdf, can you please try to remove .png from the file 
I just upload(the "sample-pdf.pdf.png")? 

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240859#comment-16240859
 ] 

Ace commented on PDFBOX-3993:
-

!sample-pdf.pdf|thumbnail!

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Issue Comment Deleted] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Comment: was deleted

(was: !sample-pdf.pdf|thumbnail!)

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Attachment: sample-pdf.pdf.png

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Issue Comment Deleted] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Comment: was deleted

(was: !sample-pdf.pdf!)

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240857#comment-16240857
 ] 

Ace commented on PDFBOX-3993:
-

!sample-pdf.pdf!

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.pdf.png, sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240856#comment-16240856
 ] 

Ace commented on PDFBOX-3993:
-

Oh, it is a pdf in my local, I copy and paste it into the comment field and 
then it converts to a png...

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3996) Add option to automatically create just one file with Splitter

2017-11-06 Thread Gilad Denneboom (JIRA)
Gilad Denneboom created PDFBOX-3996:
---

 Summary: Add option to automatically create just one file with 
Splitter
 Key: PDFBOX-3996
 URL: https://issues.apache.org/jira/browse/PDFBOX-3996
 Project: PDFBox
  Issue Type: Improvement
  Components: Utilities
Affects Versions: 2.0.8
Reporter: Gilad Denneboom
Priority: Minor


When using Splitter, its default behavior is to extract each page in the 
selected range as a separate PDDocument, but often one wants to extract that 
range as a single document. In order to do that you currently need to use the 
setSplitAtPage method and set it to the number of pages in your range, which is 
a bit silly. It would be better if there was an option to do that 
automatically, for example by setting the splitAtPage property to 0, or 
something like that, and have it automatically calculate the number of pages 
based on the startPage and endPage values.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240820#comment-16240820
 ] 

Tilman Hausherr commented on PDFBOX-3993:
-

What you uploaded is a .png file.

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ace updated PDFBOX-3993:

Attachment: sample-pdf.png

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
> Attachments: sample-pdf.png
>
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: Running tika-eval on the Rackspace vm

2017-11-06 Thread Tilman Hausherr
I think I was successful, the report now makes sense, as if Tim had 
created it himself :-) The two issues I just created are related to a 
comparison between 2.0.8 and 2.0.4.


So for that next board report, we can now (additional to the existing 
text) tell that there is now a second committer who can run the tests.


Tilman

Am 05.11.2017 um 22:06 schrieb Tilman Hausherr:
I've come closer to find out what's happening. I found out that 
tika-app was running with PDFBox 2.0.7 all the time regardless of what 
pdfbox version is in the pom.xml.


Apparently, building tika-app uses tika-parsers from the repository 
(instead building tika-parsers it again), which needs 2.0.7. 
Explicitely building tika-parsers before building tika-app helps.


This is new to me, in PDFBox  if one builds the app all dependencies 
are built as well.


Tilman

Am 04.11.2017 um 14:48 schrieb Tilman Hausherr:

So it's done:
/work/eval/pdfbox_2_0_4_Vs_2_0_8-SNAPSHOT_reports_03112017

I wonder why the differences are so few, especially in meta where I 
KNOW that there are differences, due to the handling of empty strings 
with BOM. Maybe it is because I skipped the "A" phase and used 
existing data from a 2.0.4 run that I found, or because I use a 
current tika trunk and not the existing binary that was on the server.


I'm thinking of creating a new "A" with 2.0.4 with current tika trunk 
and then compare with the "B" I did.


Tilman


Am 03.11.2017 um 22:14 schrieb Tilman Hausherr:

Am 03.11.2017 um 21:38 schrieb Allison, Timothy B.:

I'm not sure what you mean by...sorry

- "H" is missing, which is identical to "C"



I just meant the steps in https://wiki.apache.org/tika/TikaEvalOnVM

In segment 3, "execute: nohup ./appBatchExecutor.sh &" is missing. 
Of course it is obvious that it has to be done, but I am a 
perfectionist. I'd like to have this documentation for the "me" in a 
few months when I have forgotten what I did the last days. Or for 
the next person.


Thanks for the fixes you did. I wonder why writing to /tmp didn't 
work - it did work from the command line. I've started the command 
again, I'm not sure when I will report about it. I'm a bit exhausted 
from non-software activities :-(


Tilman






-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3995) IllegalArgumentException: root cannot be null with truncated file (3)

2017-11-06 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-3995:
---

 Summary: IllegalArgumentException: root cannot be null with 
truncated file (3)
 Key: PDFBOX-3995
 URL: https://issues.apache.org/jira/browse/PDFBOX-3995
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 2.0.8
Reporter: Tilman Hausherr
 Attachments: PDFBOX-3995-HU47TMUIW72SB4WR2UBJWJ4CY5IFCR3X.pdf

{code}
java.lang.IllegalArgumentException: root cannot be null
org.apache.pdfbox.pdmodel.PDPageTree.(PDPageTree.java:75)

org.apache.pdfbox.pdmodel.PDDocumentCatalog.getPages(PDDocumentCatalog.java:129)
org.apache.pdfbox.pdmodel.PDDocument.getPages(PDDocument.java:1389)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3994) ClassCastException in COSParser.bfSearchForTrailer

2017-11-06 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-3994:
---

 Summary: ClassCastException in COSParser.bfSearchForTrailer
 Key: PDFBOX-3994
 URL: https://issues.apache.org/jira/browse/PDFBOX-3994
 Project: PDFBox
  Issue Type: Bug
  Components: Parsing
Affects Versions: 2.0.8, 2.0.7
Reporter: Tilman Hausherr
 Attachments: PDFBOX-3994-A5YEWC5MQKVVX4ZQI5ZZAND6A2O3FPXM.pdf

{code}
java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
to org.apache.pdfbox.cos.COSObject

org.apache.pdfbox.pdfparser.COSParser.bfSearchForTrailer(COSParser.java:1668)
org.apache.pdfbox.pdfparser.COSParser.rebuildTrailer(COSParser.java:2110)
org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:246)
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:189)
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
{code}
This worked in 2.0.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240731#comment-16240731
 ] 

Tilman Hausherr commented on PDFBOX-3993:
-

The current version is 2.0.8. Please attach the PDF.

> PDFTextStripper.writeText is slow
> -
>
> Key: PDFBOX-3993
> URL: https://issues.apache.org/jira/browse/PDFBOX-3993
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.4
>Reporter: Ace
>
> I see this problem has been posted about two years before, and the poster 
> said it has been fixed in 2.0.0 version, but now I face this problem when I 
> try to parse a pdf file with image in it. I also add logs before and after 
> PDFTextStripper.writeText to mark the time spend, it looks like it takes 
> quite long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file 
> with image in it). The running environment is in our production server but 
> not local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3993) PDFTextStripper.writeText is slow

2017-11-06 Thread Ace (JIRA)
Ace created PDFBOX-3993:
---

 Summary: PDFTextStripper.writeText is slow
 Key: PDFBOX-3993
 URL: https://issues.apache.org/jira/browse/PDFBOX-3993
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.4
Reporter: Ace


I see this problem has been posted about two years before, and the poster said 
it has been fixed in 2.0.0 version, but now I face this problem when I try to 
parse a pdf file with image in it. I also add logs before and after 
PDFTextStripper.writeText to mark the time spend, it looks like it takes quite 
long to writeText to a StringWriter (sometimes 15s+ for a 220K pdf file with 
image in it). The running environment is in our production server but not 
local. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3988) Performance issue when rendering first page of PDF

2017-11-06 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240629#comment-16240629
 ] 

Tilman Hausherr commented on PDFBOX-3988:
-

I turned on the profiler again and saw that LCMS was used and that colorspaces 
took long to initialize (and LCMS is slow anyway).
So insert all these after the "main" line:
{code}
System.setProperty("sun.java2d.cmm", 
"sun.java2d.cmm.kcms.KcmsServiceProvider"); // should be first
PDFont font = PDType1Font.COURIER;
PDDeviceCMYK.INSTANCE.toRGB(new float[]{0,0,0,0});
PDDeviceRGB.INSTANCE.toRGB(new float[]{0,0,0});
{code}
The property is mentioned in "getting started": 
https://pdfbox.apache.org/2.0/getting-started.html

Also change your pom to have the current version, which is 2.0.8 and not 2.0.6.

I didn't notice this the first time because I had tested with PDFDebugger 
instead of your project and my local version had the property set. Mea culpa!

> Performance issue when rendering first page of PDF
> --
>
> Key: PDFBOX-3988
> URL: https://issues.apache.org/jira/browse/PDFBOX-3988
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.7
>Reporter: Maciej Matecki
> Attachments: calls.png
>
>
> Let say that you want to generate PNG for all pages in PDF.
> The generation of the first page is really slow, the second one is quite fast.
> For example test PDF contains two same PDF pages. 
> First page renders in: ~2000ms
> Second one: ~220 ms
> It looks like for the first page (inv.0) there's 359 ms overhead just for 
> creating the font.
> [^calls.png]
> Tried to use other library to perform the same operation with the same pdf 
> file and I was able to retrieve BufferedImage of the first (slower) page in 
> ~750 ms.
> For the second page the result was almost the same like for the PDFBox.
> It looks like there's the place to improve performance when starting 
> rendering the PDF page. 
> Or do you have any advices how to improve performance in the example project?
> Test project: https://github.com/mmatecki/pdfboxperformance



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3988) Performance issue when rendering first page of PDF

2017-11-06 Thread Maciej Matecki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240143#comment-16240143
 ] 

Maciej Matecki commented on PDFBOX-3988:


Thanks. That allows me to save a few hundred ms. There's still big difference 
between first and the second page:
{code}
Page 0 rendered in 1204 ms 
Page 1 rendered in 149 ms 
{code}
Above that's the pdf file where two pages are the same. So let's say that 
something is cached and the second page is much faster.

However, if I test the pdf where each page is different results look like that:
{code}
Page 0 rendered in 1172 ms 
Page 1 rendered in 763 ms 
Page 2 rendered in 208 ms 
Page 3 rendered in 56 ms 
Page 4 rendered in 78 ms 
Page 5 rendered in 161 ms 
{code}

The interesting thing in the last commit to my small benchmark project there're 
two pdfs rendered:

{code}
try (PDDocument document = getPDFDocument("/test-pdf-18.pdf")) {
renderPages(document, dpi);
System.out.println("---");
} catch (Exception e) {
System.out.println("Ups. " + e.getMessage());
}

try (PDDocument document = getPDFDocument("/test-2p.pdf")) {
renderPages(document, dpi);
System.out.println("---");
} catch (Exception e) {
System.out.println("Ups. " + e.getMessage());
}
{code}

And the results looks like that:
{code}
Page 0 rendered in 1083 ms 
Page 1 rendered in 718 ms 
Page 2 rendered in 185 ms 
Page 3 rendered in 64 ms 
Page 4 rendered in 82 ms 
Page 5 rendered in 164 ms 
Page 6 rendered in 1216 ms 
Page 7 rendered in 184 ms 
Page 8 rendered in 104 ms 
Page 9 rendered in 99 ms 
Page 10 rendered in 112 ms 
---
Page 0 rendered in 123 ms 
Page 1 rendered in 113 ms 
---
{code}

So the first page of the second PDF is rendered really quickly.

Is there something more to initialize earlier so later results are better?


> Performance issue when rendering first page of PDF
> --
>
> Key: PDFBOX-3988
> URL: https://issues.apache.org/jira/browse/PDFBOX-3988
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.7
>Reporter: Maciej Matecki
> Attachments: calls.png
>
>
> Let say that you want to generate PNG for all pages in PDF.
> The generation of the first page is really slow, the second one is quite fast.
> For example test PDF contains two same PDF pages. 
> First page renders in: ~2000ms
> Second one: ~220 ms
> It looks like for the first page (inv.0) there's 359 ms overhead just for 
> creating the font.
> [^calls.png]
> Tried to use other library to perform the same operation with the same pdf 
> file and I was able to retrieve BufferedImage of the first (slower) page in 
> ~750 ms.
> For the second page the result was almost the same like for the PDFBox.
> It looks like there's the place to improve performance when starting 
> rendering the PDF page. 
> Or do you have any advices how to improve performance in the example project?
> Test project: https://github.com/mmatecki/pdfboxperformance



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org