[jira] [Commented] (PDFBOX-4080) Improve memory consumption of PDAbstractAppearanceHandler

2018-01-25 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340638#comment-16340638
 ] 

Maruan Sahyoun commented on PDFBOX-4080:


OK - I've added the parameter but for now haven't implemented it within the 
handlers to pass that on. What I would like is to come up with a solution where 
we don't have to pass this around if doable as I find it very unintuitive that 
this is needed at that level but I haven't looked into the whole 
{{ScratchFile}} mechanism up to know. Usage of the "memory model" should be 
transparent to the user after setting that with {{PDDocument.load()}} IMHO.

I'd like to get rid of that parameter.

Ideas?

> Improve memory consumption of PDAbstractAppearanceHandler
> -
>
> Key: PDFBOX-4080
> URL: https://issues.apache.org/jira/browse/PDFBOX-4080
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>  Labels: Annotations
> Fix For: 3.0.0 PDFBox
>
>
> PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
> footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
> document, or the document scratch file, or there will be trouble for files 
> with many annotations, e.g. a long scientific document with many footnotes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2018-01-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339868#comment-16339868
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

wrong_space_parsed_sample.pdf is from Hesham Gneady from the mailing list and 
fails text extraction with the repository code and succeeds with the modified 
code in this issue.

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>Priority: Major
>  Labels: how-to
> Attachments: LegacyPDFStreamEngine.java, LegacyPDFStreamEngine.java, 
> formula-marked-34.png, paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf, simpleAnnotation.pdf, wrong_space_parsed_sample.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2018-01-25 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3970:

Attachment: wrong_space_parsed_sample.pdf

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>Priority: Major
>  Labels: how-to
> Attachments: LegacyPDFStreamEngine.java, LegacyPDFStreamEngine.java, 
> formula-marked-34.png, paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf, simpleAnnotation.pdf, wrong_space_parsed_sample.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4071) Improve code quality (3)

2018-01-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339840#comment-16339840
 ] 

ASF subversion and git services commented on PDFBOX-4071:
-

Commit 187 from [~tilman] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r187 ]

PDFBOX-4071: delete code line that removes the action type; delete super() call

> Improve code quality (3)
> 
>
> Key: PDFBOX-4071
> URL: https://issues.apache.org/jira/browse/PDFBOX-4071
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.8
>Reporter: Tilman Hausherr
>Priority: Major
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2852, which was getting too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4071) Improve code quality (3)

2018-01-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339839#comment-16339839
 ] 

ASF subversion and git services commented on PDFBOX-4071:
-

Commit 186 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r186 ]

PDFBOX-4071: delete code line that removes the action type; delete super() call

> Improve code quality (3)
> 
>
> Key: PDFBOX-4071
> URL: https://issues.apache.org/jira/browse/PDFBOX-4071
> Project: PDFBox
>  Issue Type: Task
>Affects Versions: 2.0.8
>Reporter: Tilman Hausherr
>Priority: Major
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube 
> report|https://analysis.apache.org/dashboard/index/org.apache.pdfbox:pdfbox-reactor],
>  hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-2852, which was getting too long.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-Trunk-jdk9 #215

2018-01-25 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-Trunk-jdk9 » Apache PDFBox examples #215

2018-01-25 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-trunk » Apache PDFBox examples #3751

2018-01-25 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-trunk #3751

2018-01-25 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Build failed in Jenkins: PDFBox-Trunk-jdk9 #214

2018-01-25 Thread Apache Jenkins Server
See 


Changes:

[msahyoun] PDFBOX-4080: allow to pass a ScratchFile to constructAppearances(); 
needs further changes within appearance handlers

--
[...truncated 380.60 KB...]
[INFO] 
[INFO] 
[INFO] --- maven-source-plugin:2.3:jar (attach-sources) @ pdfbox-app ---
[INFO] Building jar: 

[INFO] 
[INFO] --- apache-rat-plugin:0.12:check (default) @ pdfbox-app ---
[INFO] Enabled default license matchers.
[INFO] Will parse SCM ignores for exclusions...
[INFO] Finished adding exclusions from SCM ignore files.
[INFO] 61 implicit excludes (use -debug for more details).
[INFO] Exclude: release.properties
[INFO] 3 resources included (use -debug for more details)
[INFO] Rat check: Summary over all files. Unapproved: 0, unknown: 0, generated: 
0, approved: 1 licenses.
[INFO] 
[INFO] --- dependency-check-maven:3.1.0:check (default) @ pdfbox-app ---
[INFO] Checking for updates
[INFO] Skipping NVD check since last check was within 4 hours.
[INFO] Check for updates complete (523 ms)
[INFO] Analysis Started
[INFO] Finished Archive Analyzer (0 seconds)
[INFO] Finished File Name Analyzer (0 seconds)
[INFO] Finished Jar Analyzer (0 seconds)
[INFO] Finished Central Analyzer (0 seconds)
[INFO] Finished Dependency Merging Analyzer (0 seconds)
[INFO] Finished Version Filter Analyzer (0 seconds)
[INFO] Finished Hint Analyzer (0 seconds)
[INFO] Created CPE Index (0 seconds)
[INFO] Skipping CPE Analysis for npm
[INFO] Finished CPE Analyzer (0 seconds)
[INFO] Finished False Positive Analyzer (0 seconds)
[INFO] Finished Cpe Suppression Analyzer (0 seconds)
[INFO] Finished NVD CVE Analyzer (0 seconds)
[INFO] Finished Vulnerability Suppression Analyzer (0 seconds)
[INFO] Finished Dependency Bundling Analyzer (0 seconds)
[INFO] Analysis Complete (1 seconds)
[INFO] 
[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ pdfbox-app ---
[INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-SNAPSHOT.jar
[INFO] Installing 
 to 
/home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-SNAPSHOT.pom
[INFO] Installing 

 to 
/home/jenkins/jenkins-slave/maven-repositories/1/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-SNAPSHOT-sources.jar
[INFO] 
[INFO] --- maven-bundle-plugin:3.3.0:install (default-install) @ pdfbox-app ---
[INFO] Installing 
org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-SNAPSHOT.jar
[INFO] Writing OBR metadata
[INFO] 
[INFO] 
[INFO] Building Apache PDFBox Debugger application 3.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ debugger-app ---
[TASKS] Scanning folder 
' for 
files matching the pattern '**/*.java' - excludes: 
[TASKS] Found 0 files to scan for tasks
Found 0 open tasks.
[TASKS] Computing warning deltas based on reference build #213
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ debugger-app 
---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
debugger-app ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 

[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ debugger-app 
---
[INFO] No sources to compile
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
debugger-app ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 

[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
debugger-app ---
[INFO] No sources to compile
[INFO] 
[INFO] --- maven-surefire-plugin:2.17:test (default-test) @ debugger-app ---
[JENKINS] Recording test results
[INFO] 
[INFO] --- animal-sniffer-maven-plugin:1.15:check (check-java-version) @ 
debugger-app ---
[INFO] Checking unresolved references to org.codehaus.mojo.signature:java17:1.0
[INFO] 
[INFO] --- maven-bundle-plugin:3.3.0:bundle (default-bundle) @ debugger-app ---
[WARNING] Bundle 

Build failed in Jenkins: PDFBox-Trunk-jdk9 » Apache PDFBox examples #214

2018-01-25 Thread Apache Jenkins Server
See 


--
[INFO] 
[INFO] 
[INFO] Building Apache PDFBox examples 3.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ pdfbox-examples ---
[TASKS] Scanning folder 
'
 for files matching the pattern '**/*.java' - excludes: 
[TASKS] Found 79 files to scan for tasks
Found 15 open tasks.
[TASKS] Computing warning deltas based on reference build #213
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
pdfbox-examples ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
pdfbox-examples ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
pdfbox-examples ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 76 source files to 

[INFO] -
[WARNING] COMPILATION WARNING : 
[INFO] -
[WARNING] bootstrap class path not set in conjunction with -source 1.7
[WARNING] 
:[173,31]
 newInstance() in java.lang.Class has been deprecated
[WARNING] 
:
 Some input files use unchecked or unsafe operations.
[WARNING] 
:
 Recompile with -Xlint:unchecked for details.
[INFO] 4 warnings 
[INFO] -
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
:[244,20]
 method constructAppearances in class 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation cannot be applied 
to given types;
  required: org.apache.pdfbox.io.ScratchFile
  found: no arguments
  reason: actual and formal argument lists differ in length
[INFO] 1 error
[INFO] -
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] PDFBox parent .. SUCCESS [03:13 min]
[INFO] Apache FontBox . SUCCESS [ 52.908 s]
[INFO] Apache XmpBox .. SUCCESS [ 26.019 s]
[INFO] Apache PDFBox .. SUCCESS [02:26 min]
[INFO] Apache Preflight ... SUCCESS [03:38 min]
[INFO] Apache Preflight application ... SUCCESS [ 28.426 s]
[INFO] Apache PDFBox Debugger . SUCCESS [ 21.496 s]
[INFO] Apache PDFBox tools  SUCCESS [ 32.156 s]
[INFO] Apache PDFBox application .. SUCCESS [ 27.802 s]
[INFO] Apache PDFBox Debugger application . SUCCESS [ 27.684 s]
[INFO] Apache PDFBox examples . FAILURE [  8.437 s]
[INFO] Apache PDFBox .. SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 13:55 min
[INFO] Finished at: 2018-01-25T18:41:46Z
[INFO] Final Memory: 64M/214M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project pdfbox-examples: Compilation failure
[ERROR] 
:[244,20]
 method constructAppearances in class 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation cannot be applied 
to given types;
[ERROR] 

[jira] [Commented] (PDFBOX-4080) Improve memory consumption of PDAbstractAppearanceHandler

2018-01-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339599#comment-16339599
 ] 

ASF subversion and git services commented on PDFBOX-4080:
-

Commit 1822213 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1822213 ]

PDFBOX-4080: update example to use ScratchFile

> Improve memory consumption of PDAbstractAppearanceHandler
> -
>
> Key: PDFBOX-4080
> URL: https://issues.apache.org/jira/browse/PDFBOX-4080
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>  Labels: Annotations
> Fix For: 3.0.0 PDFBox
>
>
> PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
> footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
> document, or the document scratch file, or there will be trouble for files 
> with many annotations, e.g. a long scientific document with many footnotes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Build failed in Jenkins: PDFBox-trunk #3750

2018-01-25 Thread Apache Jenkins Server
See 


Changes:

[msahyoun] PDFBOX-4080: allow to pass a ScratchFile to constructAppearances(); 
needs further changes within appearance handlers

--
[...truncated 205.94 KB...]
[INFO] --- maven-bundle-plugin:3.3.0:install (default-install) @ pdfbox-app ---
[INFO] Installing 
org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-SNAPSHOT.jar
[INFO] Writing OBR metadata
[INFO] 
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ pdfbox-app ---
[INFO] Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/maven-metadata.xml
[INFO] Downloaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/maven-metadata.xml
 (999 B at 2.2 kB/s)
[INFO] Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-20180125.181253-402.jar
[INFO] Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-20180125.181253-402.jar
 (8.5 MB at 4.8 MB/s)
[INFO] Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-20180125.181253-402.pom
[INFO] Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-20180125.181253-402.pom
 (2.9 kB at 3.2 kB/s)
[INFO] Downloading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/maven-metadata.xml
[INFO] Downloaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/maven-metadata.xml
 (468 B at 1.0 kB/s)
[INFO] Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/maven-metadata.xml
[INFO] Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/maven-metadata.xml
 (999 B at 1.1 kB/s)
[INFO] Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/maven-metadata.xml
[INFO] Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/maven-metadata.xml
 (468 B at 519 B/s)
[INFO] Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-20180125.181253-402-sources.jar
[INFO] Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/pdfbox-app-3.0.0-20180125.181253-402-sources.jar
 (7.3 kB at 8.1 kB/s)
[INFO] Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/maven-metadata.xml
[INFO] Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/3.0.0-SNAPSHOT/maven-metadata.xml
 (999 B at 1.1 kB/s)
[INFO] 
[INFO] --- maven-bundle-plugin:3.3.0:deploy (default-deploy) @ pdfbox-app ---
[INFO] Remote OBR update disabled (enable with -DremoteOBR)
[INFO] 
[INFO] 
[INFO] Building Apache PDFBox Debugger application 3.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ debugger-app ---
[TASKS] Scanning folder 
' for files 
matching the pattern '**/*.java' - excludes: 
[TASKS] Found 0 files to scan for tasks
Found 0 open tasks.
[TASKS] Computing warning deltas based on reference build #3749
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ debugger-app 
---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
debugger-app ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 

[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ debugger-app 
---
[INFO] No sources to compile
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
debugger-app ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory 

[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ 
debugger-app ---
[INFO] No sources to compile
[INFO] 
[INFO] --- maven-surefire-plugin:2.17:test (default-test) @ debugger-app ---
[JENKINS] Recording test results[INFO] 

[INFO] --- 

Build failed in Jenkins: PDFBox-trunk » Apache PDFBox examples #3750

2018-01-25 Thread Apache Jenkins Server
See 


--
[INFO] 
[INFO] 
[INFO] Building Apache PDFBox examples 3.0.0-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ pdfbox-examples ---
[TASKS] Scanning folder 
'
 for files matching the pattern '**/*.java' - excludes: 
[TASKS] Found 79 files to scan for tasks
Found 15 open tasks.
[TASKS] Computing warning deltas based on reference build #3749
[INFO] 
[INFO] --- maven-remote-resources-plugin:1.5:process (default) @ 
pdfbox-examples ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ 
pdfbox-examples ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
pdfbox-examples ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 76 source files to 

[INFO] -
[WARNING] COMPILATION WARNING : 
[INFO] -
[WARNING] bootstrap class path not set in conjunction with -source 1.7
[WARNING] 
:
 Some input files use unchecked or unsafe operations.
[WARNING] 
:
 Recompile with -Xlint:unchecked for details.
[INFO] 3 warnings 
[INFO] -
[INFO] -
[ERROR] COMPILATION ERROR : 
[INFO] -
[ERROR] 
:[244,20]
 method constructAppearances in class 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation cannot be applied 
to given types;
  required: org.apache.pdfbox.io.ScratchFile
  found: no arguments
  reason: actual and formal argument lists differ in length
[INFO] 1 error
[INFO] -
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] PDFBox parent .. SUCCESS [03:32 min]
[INFO] Apache FontBox . SUCCESS [ 58.507 s]
[INFO] Apache XmpBox .. SUCCESS [ 31.404 s]
[INFO] Apache PDFBox .. SUCCESS [02:31 min]
[INFO] Apache Preflight ... SUCCESS [03:57 min]
[INFO] Apache Preflight application ... SUCCESS [ 36.177 s]
[INFO] Apache PDFBox Debugger . SUCCESS [ 26.172 s]
[INFO] Apache PDFBox tools  SUCCESS [ 35.424 s]
[INFO] Apache PDFBox application .. SUCCESS [ 35.665 s]
[INFO] Apache PDFBox Debugger application . SUCCESS [ 36.134 s]
[INFO] Apache PDFBox examples . FAILURE [  8.608 s]
[INFO] Apache PDFBox .. SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 15:21 min
[INFO] Finished at: 2018-01-25T18:14:05Z
[INFO] Final Memory: 68M/572M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project pdfbox-examples: Compilation failure
[ERROR] 
:[244,20]
 method constructAppearances in class 
org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation cannot be applied 
to given types;
[ERROR] required: org.apache.pdfbox.io.ScratchFile
[ERROR] found: no arguments
[ERROR] reason: actual and formal argument lists differ in length
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X 

[jira] [Commented] (PDFBOX-4080) Improve memory consumption of PDAbstractAppearanceHandler

2018-01-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339556#comment-16339556
 ] 

ASF subversion and git services commented on PDFBOX-4080:
-

Commit 1822209 from [~msahyoun] in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1822209 ]

PDFBOX-4080: allow to pass a ScratchFile to constructAppearances(); needs 
further changes within appearance handlers

> Improve memory consumption of PDAbstractAppearanceHandler
> -
>
> Key: PDFBOX-4080
> URL: https://issues.apache.org/jira/browse/PDFBOX-4080
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>  Labels: Annotations
> Fix For: 3.0.0 PDFBox
>
>
> PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
> footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
> document, or the document scratch file, or there will be trouble for files 
> with many annotations, e.g. a long scientific document with many footnotes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[GitHub] pdfbox pull request #43: PDColorSpace ISSUE: Unable to create a valid color ...

2018-01-25 Thread santino83
GitHub user santino83 opened a pull request:

https://github.com/apache/pdfbox/pull/43

PDColorSpace ISSUE: Unable to create a valid color space with COSDictionary

Unable to create a valid color space when COSBase colorSpace is a 
COSDictionary with key COSName.COLORSPACE => value a valid color space.

We are run in this issue trying to transform a pdf of a customer of us into 
jpgs. This pdf has 9 pages, one of them, when ColorSpace is processing, returns 
a COSDictionary with a single entry, a COSName.COLORSPACE => 
COSName.DEFAULT_RGB key/value pair. In this situation, PDColorSpace.create 
throws an exception instead of handle the case. 

We can provide the original pdf that rises the issue, but only in private 
mode because it's full of personal information and we won't edited it to not 
risk to lose the original COS configuration

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/santino83/pdfbox 2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/pdfbox/pull/43.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #43


commit dd5c64d2890151145b01b3da8e8037fd4225e26f
Author: Giorgio M. Santini 
Date:   2018-01-25T17:26:03Z

PDColorSpace ISSUE: Unable to determinate a valid colro space when COSBase 
colorSpace is a COSDictionary with key COSName.COLORSPACE => value a valid 
color space




---

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Re: FW: Word Merging Problem

2018-01-25 Thread Tilman Hausherr
I tried running your code and I can't because it was written for an 
older version of PDFBox (probably 1.8) and it has a syntax error and the 
parameters are missing so I doubt your code ever ran that way. I tried 
running ExtractText on PDFBox 1.8 and yes, many blanks are missing. So 
please use the current version 2.0.8. I found one occurrence where the 
blank was missing ("Wewould") but Adobe Reader has the same problem.


Tilman


Am 25.01.2018 um 04:22 schrieb Laxmi Narayan:


Hi Team,

I have a problem while text extracting from pdf. When we extracting 
the text words merge together.  Can you suggest me , what we have to 
do for the same.


I have attached the PDF file from which I am extracting the text. And 
I am using the below code to extract the text.


Please help me as soon as possible.

privatestatic string GetTextByArea_Orgnal(PDDocument doc, int x, int 
y, int w, int h)


    {

PDFTextStripperByArea stripper = new PDFTextStripperByArea("UTF-8");

stripper.setLineSeparator(" ");

stripper.setDropThreshold(3);

stripper.setWordSeparator(" ");

stripper.setParagraphStart("");

stripper.setParagraphEnd("");

stripper.setIndentThreshold(1);

stripper.setSortByPosition(true);

//==

//==

Dimension d = new Dimension(w, h);

Rectangle rect = new Rectangle(new Point(x, y), d);

stripper.addRegion("class1", rect);

java.util.List allPages = doc.getDocumentCatalog().getAllPages();

PDPage firstPage = (PDPage)allPages.get(0);

 overlay the region with a cyan rectangle to check if I got the 
coordinates and dimensions right


PDPageContentStream contentStream = new PDPageContentStream(doc, 
firstPage, true, true);


contentStream.setNonStrokingColor(Color.CYAN);

contentStream.fillRect(x, y, w, h);

contentStream.close();

=

stripper.extractRegions(firstPage);

return stripper.getTextForRegion("class1");

    }

Thanks,

Laxmi Narayan



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org





[jira] [Commented] (PDFBOX-4080) Improve memory consumption of PDAbstractAppearanceHandler

2018-01-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339514#comment-16339514
 ] 

Tilman Hausherr commented on PDFBOX-4080:
-

How about passing the document or the scratch file in constructAppearances? It 
should then be passed to generateAppearanceStreams() and then to the actual 
methods if needed. My first thought was to pass the document because it is 
easier to understand. OTOH passing the scratch file means passing only what we 
really need.

> Improve memory consumption of PDAbstractAppearanceHandler
> -
>
> Key: PDFBOX-4080
> URL: https://issues.apache.org/jira/browse/PDFBOX-4080
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>  Labels: Annotations
> Fix For: 3.0.0 PDFBox
>
>
> PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
> footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
> document, or the document scratch file, or there will be trouble for files 
> with many annotations, e.g. a long scientific document with many footnotes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4081) Image with JPXDecode filter not render perfectly

2018-01-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339503#comment-16339503
 ] 

Tilman Hausherr commented on PDFBOX-4081:
-

This is a problem with the JP2 decoder, see PDFBOX-1819. Try this
 - with PDFDebugger save the image into PDFBOX-4081.jp2 (attached)
 - view it with your favourite viewer, see that it looks ok
 - read this image file with ImageIO
 - save it as png
 - cry when seeing the result.
 
 I'll reporting this on github later, but there's nothing we can do, except 
write our own decoder.

> Image with JPXDecode filter not render perfectly
> 
>
> Key: PDFBOX-4081
> URL: https://issues.apache.org/jira/browse/PDFBOX-4081
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.8
>Reporter: savan patel
>Priority: Major
> Attachments: PDFBOX-4081.jp2, PDFBOX-4081.png, selection3.pdf
>
>
> There is a image in a pdf which has a JPXDecode filter applied on it and it 
> is rendered badly...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4081) Image with JPXDecode filter not render perfectly

2018-01-25 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4081:

Attachment: PDFBOX-4081.png

> Image with JPXDecode filter not render perfectly
> 
>
> Key: PDFBOX-4081
> URL: https://issues.apache.org/jira/browse/PDFBOX-4081
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.8
>Reporter: savan patel
>Priority: Major
> Attachments: PDFBOX-4081.jp2, PDFBOX-4081.png, selection3.pdf
>
>
> There is a image in a pdf which has a JPXDecode filter applied on it and it 
> is rendered badly...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4081) Image with JPXDecode filter not render perfectly

2018-01-25 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4081:

Attachment: PDFBOX-4081.jp2

> Image with JPXDecode filter not render perfectly
> 
>
> Key: PDFBOX-4081
> URL: https://issues.apache.org/jira/browse/PDFBOX-4081
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.8
>Reporter: savan patel
>Priority: Major
> Attachments: PDFBOX-4081.jp2, PDFBOX-4081.png, selection3.pdf
>
>
> There is a image in a pdf which has a JPXDecode filter applied on it and it 
> is rendered badly...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4081) Image with JPXDecode filter not render perfectly

2018-01-25 Thread savan patel (JIRA)
savan patel created PDFBOX-4081:
---

 Summary: Image with JPXDecode filter not render perfectly
 Key: PDFBOX-4081
 URL: https://issues.apache.org/jira/browse/PDFBOX-4081
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Affects Versions: 2.0.8
Reporter: savan patel
 Attachments: selection3.pdf

There is a image in a pdf which has a JPXDecode filter applied on it and it is 
rendered badly...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4080) Improve memory consumption of PDAbstractAppearanceHandler

2018-01-25 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339004#comment-16339004
 ] 

Maruan Sahyoun commented on PDFBOX-4080:


[~tilman] maybe we could have a quick chat about approaching this.

> Improve memory consumption of PDAbstractAppearanceHandler
> -
>
> Key: PDFBOX-4080
> URL: https://issues.apache.org/jira/browse/PDFBOX-4080
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.0 PDFBox
>Reporter: Maruan Sahyoun
>Assignee: Maruan Sahyoun
>Priority: Major
>  Labels: Annotations
> Fix For: 3.0.0 PDFBox
>
>
> PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
> footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
> document, or the document scratch file, or there will be trouble for files 
> with many annotations, e.g. a long scientific document with many footnotes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3353) Create appearance streams for annotations

2018-01-25 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339002#comment-16339002
 ] 

Maruan Sahyoun commented on PDFBOX-3353:


{quote}
PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
document, or the document scratch file, or there will be trouble for files with 
many annotations, e.g. a long scientific document with many footnotes.
{quote}

I've created PDFBOX-4080 for that. We should handle that outside of this issue. 



> Create appearance streams for annotations
> -
>
> Key: PDFBOX-3353
> URL: https://issues.apache.org/jira/browse/PDFBOX-3353
> Project: PDFBox
>  Issue Type: Task
>  Components: PDModel, Rendering
>Affects Versions: 1.8.12, 2.0.0, 2.0.1, 2.0.2, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: Annotations
> Attachments: AnnotationSample.Standard.pdf, 
> CTAN-example-Annotations-rot270.pdf, CTAN-example-Annotations.pdf, 
> PDFBOX-2019-Annotations.pdf, PDFBOX-2898-Annotations.pdf, 
> PDFBOX-3353-highlight-noAP-001796-p1.pdf, PDFBOX-3353-highlight-noAP.pdf, 
> PDFJS-7115-indirect-rect.pdf, ShowAnnotation-4.java, ShowAnnotation-5.java, 
> ShowAnnotation-6.java, SquareAnnotations.pdf, annots.pdf, 
> gs-bugzilla-693664-AnnotationTest.pdf, 
> line_dimension_appearance_stream-noAP.pdf, 
> line_dimension_appearance_stream.pdf, pdf_commenting_new.pdf, 
> showAnnotation.java, text_markup_ap_test.pdf
>
>
> Create appearance streams for annotations when missing.
> I'll start by replacing current code for Ink and Link annotations.
> Good example PDFs:
> http://www.pdfill.com/example/pdf_commenting_new.pdf
> https://github.com/mozilla/pdf.js/issues/6810



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4080) Improve memory consumption of PDAbstractAppearanceHandler

2018-01-25 Thread Maruan Sahyoun (JIRA)
Maruan Sahyoun created PDFBOX-4080:
--

 Summary: Improve memory consumption of PDAbstractAppearanceHandler
 Key: PDFBOX-4080
 URL: https://issues.apache.org/jira/browse/PDFBOX-4080
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 3.0.0 PDFBox
Reporter: Maruan Sahyoun
Assignee: Maruan Sahyoun
 Fix For: 3.0.0 PDFBox


PDAbstractAppearanceHandler calls new COSStream(), this has a huge memory 
footprint (PDFBOX-3868 and PDFBOX-3852). We'd need to find a way to pass the 
document, or the document scratch file, or there will be trouble for files with 
many annotations, e.g. a long scientific document with many footnotes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org