[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Amit Maheshwari (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626747#comment-16626747
 ] 

Amit Maheshwari commented on PDFBOX-4322:
-

hello [~tilman] ,

I understand. Kindly keep me updated with the outcome of this regression test 
(and other important things/events which i should know in this matter)

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626744#comment-16626744
 ] 

Tilman Hausherr commented on PDFBOX-4322:
-

What I was thinking (unless other opinions come up) was to make the change in 
the 2.0 branch after release of 2.0.12 which is soon, hopefully. So it should 
be in 2.0.13 which would be released in 3-4 months. Or you use a snapshot which 
would appear within hours of making the commit. A third possibility is that you 
take the source code of the 2.0.2 release (you seem to insist on that one?), 
make the change (change a few lines in PDFont.java) and build locally (there 
will NOT be a modified 2.0.2 release).

I'd still recommend that you wait until Tim has made the regression test. This 
is a test with 25 PDF files, and it is analysed whether the extraction is 
better or not.

The "worst" that could happen is that we get more PDFs with garbled text than 
before, as in PDFBOX-3123.

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Amit Maheshwari (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626717#comment-16626717
 ] 

Amit Maheshwari edited comment on PDFBOX-4322 at 9/25/18 3:50 AM:
--

Hello [~talli...@apache.org] [~tilman] ,

Thanks for your effort in this matter. So now as it seems that you have found 
the cause of the problem and already fixed (into trunk), may I know when this 
fix will be available for us?

 

Or is it possible for you to incorporate this fix into your 2.0.2 branch and 
provide us as a hot-fix? 


was (Author: aa.amit.mahheshwari):
Hello [~talli...@apache.org] [~tilman] ,

Thanks for your effort in this matter. So now as it seems that you have found 
the cause of the problem and already fixed (into trunk), may I know when this 
fix will be available for us?

 

Or is it possible for you to incorporate this fix into your 2.0.2 branch and 
provide us it as an hot-fix? 

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Amit Maheshwari (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626717#comment-16626717
 ] 

Amit Maheshwari edited comment on PDFBOX-4322 at 9/25/18 3:49 AM:
--

Hello [~talli...@apache.org] [~tilman] ,

Thanks for your effort in this matter. So now as it seems that you have found 
the cause of the problem and already fixed (into trunk), may I know when this 
fix will be available for us?

 

Or is it possible for you to incorporate this fix into your 2.0.2 branch and 
provide us it as an hot-fix? 


was (Author: aa.amit.mahheshwari):
Hello [~talli...@apache.org] [~tilman] ,

Thanks for your effort in this matter. So now as it seems that you have found 
the cause of the problem and already fixed (into trunk), may I know when this 
fix will be available for us?

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Amit Maheshwari (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626717#comment-16626717
 ] 

Amit Maheshwari commented on PDFBOX-4322:
-

Hello [~talli...@apache.org] [~tilman] ,

Thanks for your effort in this matter. So now as it seems that you have found 
the cause of the problem and already fixed (into trunk), may I know when this 
fix will be available for us?

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-trunk #4219

2018-09-24 Thread Apache Jenkins Server
See 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626325#comment-16626325
 ] 

Tilman Hausherr commented on PDFBOX-4322:
-

[~talli...@apache.org] I prefer to wait... usually the regression tests bring 
more work if changes were made that influence parsing and/or text extraction.

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4322:

Fix Version/s: 3.0.0 PDFBox

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Build failed in Jenkins: PDFBox-trunk #4218

2018-09-24 Thread Apache Jenkins Server
See 

--
[...truncated 8.22 KB...]
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:937)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:864)
at hudson.scm.SCM.checkout(SCM.java:504)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1794)
at 
hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at 
hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Caused: java.io.IOException
at 
hudson.scm.subversion.UpdateUpdater$TaskImpl.perform(UpdateUpdater.java:216)
at 
hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:168)
at 
hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:1041)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:1017)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:990)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2918)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused: java.io.IOException: remote file operation failed: 
 at 
hudson.remoting.Channel@151643a6:H33
at hudson.FilePath.act(FilePath.java:1043)
at hudson.FilePath.act(FilePath.java:1025)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:937)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:864)
at hudson.scm.SCM.checkout(SCM.java:504)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1794)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Retrying after 10 seconds
Cleaning up 
ERROR: Failed to update http://svn.apache.org/repos/asf/pdfbox/trunk
org.tmatesoft.svn.core.SVNException: svn: E155032: The pristine text with 
checksum '$sha1$0d8a658d1ecd3fa00dd8f46d24a9d2fdef18f3e4' was found in the DB 
but not on disk
at 
org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:70)
at 
org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:57)
at 
org.tmatesoft.svn.core.internal.wc17.db.SvnWcDbPristines.checkPristine(SvnWcDbPristines.java:159)
at 
org.tmatesoft.svn.core.internal.wc17.db.SvnWcDbPristines.getPristinePath(SvnWcDbPristines.java:184)
at 
org.tmatesoft.svn.core.internal.wc17.db.SVNWCDb.getPristinePath(SVNWCDb.java:1724)
at 
org.tmatesoft.svn.core.internal.wc17.SVNWCContext.isTextModified(SVNWCContext.java:754)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.assembleStatus(SVNStatusEditor17.java:356)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.sendStatusStructure(SVNStatusEditor17.java:216)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.getDirStatus(SVNStatusEditor17.java:742)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.walkStatus(SVNStatusEditor17.java:665)
at 
org.tmatesoft.svn.core.internal.wc2.ng.SvnNgGetStatus.run(SvnNgGetStatus.java:132)
at 
org.tmatesoft.svn.core.internal.wc2.ng.SvnNgGetStatus.run(SvnNgGetStatus.java:27)
at 
org.tmatesoft.svn.core.internal.wc2.ng.SvnNgOperationRunner.run(SvnNgOperationRunner.java:20)
at 
org.tmatesoft.svn.core.internal.wc2.SvnOperationRunner.run(SvnOperationRunner.java:21)
at 

[jira] [Commented] (PDFBOX-4252) PDChoice related bugs and issues

2018-09-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626227#comment-16626227
 ] 

Tilman Hausherr commented on PDFBOX-4252:
-

Yes, this would definitively be nice.

> PDChoice related bugs and issues
> 
>
> Key: PDFBOX-4252
> URL: https://issues.apache.org/jira/browse/PDFBOX-4252
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.10
>Reporter: Xin Lin
>Priority: Major
>  Labels: Appearance
> Attachments: test listbox and droplist export values.pdf, test 
> listbox_and_droplist export values flattened.pdf, top index 0 flattened.pdf, 
> top index 0 with index 4 set as selected.pdf, top index 0.pdf, top index 3.pdf
>
>
> There are several issues related to either PDListBox or PDComboBox that are 
> still not fixed in 2.0.10, I am going to put them in this one case, let me 
> know if they need to be broken into separate tickets. Thanks.
> 1. When I attempt to set value to a PDListBox whose 'Top Index' is greater 
> than 0 (as in [^top index 3.pdf]), I always receive the following exception:
> {noformat}
> java.lang.IllegalStateException: Error: You must call beginText() before 
> calling endText.
>   at 
> org.apache.pdfbox.pdmodel.PDPageContentStream.endText(PDPageContentStream.java:381)
>  ~[pdfbox-2.0.10.jar:2.0.10]
> ...
> {noformat}
> I tracked it down to a for loop at the end of the private method 
> insertGeneratedListboxAppearance in the class AppearanceGeneratorHelper:
> {code:java}
>  for (int i = topIndex; i < numOptions; i++)
> {
>
> if (i == topIndex)
> {
> yTextPos = yTextPos - font.getFontDescriptor().getAscent() / 
> FONTSCALE * fontSize;
> }
> else
> {
> yTextPos = yTextPos - font.getBoundingBox().getHeight() / 
> FONTSCALE * fontSize;
> contents.beginText();
> }
> contents.newLineAtOffset(contentRect.getLowerLeftX(), yTextPos);
> contents.showText(options.get(i));
> if (i - topIndex != (numOptions - 1))
> {
> contents.endText();
> }
> }
> {code}
> The last 'if' clause, when topIndex == 0, this makes sense, which is to NOT 
> call endText if we are at the last option because the private method 
> insertGeneratedAppearance which calls insertGeneratedListboxAppearance would 
> later call endText once again. If topIndex > 0, the condition in this 'if' 
> clause would always be true (since i can never be greater than numOptions - 
> 1), as a result, endText is called every time in this 'for' loop, so after 
> the method returns and the next endText is called, we receive the exception. 
> If I change that to
> {code:java}
>  if (i != (numOptions - 1))
> {
> contents.endText();
> }
> {code}
> things would start to work again.
> 2. a related issue, suppose I have a list box with top index equals to 0 and 
> too many options for the list box to show all of them at once (as in [^top 
> index 0.pdf]). When I select an option that is not visible with a top index 
> of 0 (in order to see it, we need to scroll the list box down), unlike 
> Acrobat which would adjust the top index so the selected option would be 
> visible, PDFBox does not recalculate the top index and would stick with the 
> initial value of 0. I suppose, If we fix item #1 above (i.e. top index 
> greater than 0), the opposite would also be true (if top index is say 6 and 
> you select the first item which is not visible unless you scroll the list box 
> up). This would make it useless if I flatten the document since there is no 
> way I can see the selected option (see  [^top index 0 flattened.pdf]). It is 
> also next to impossible to see the selected option even if I do not flatten 
> it (see  [^top index 0 with index 4 set as selected.pdf]). I would expect 
> PDFBox to recalculate the top index so that at least the first selected 
> option is visible (if there are additional selected options, show more 
> options when possible)
> 3. When flattening, drop down list or PDComboBox with options that have 
> export values only shows the export values instead of the label. This not a 
> problem for the list box. (e.g. form:  [^test listbox and droplist export 
> values.pdf], after flattening:  [^test listbox_and_droplist export values 
> flattened.pdf]). I would expect the drop down list to behave the same as list 
> box (i.e. when flattened, it should also show the label instead of the export 
> value.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4252) PDChoice related bugs and issues

2018-09-24 Thread Maruan Sahyoun (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626220#comment-16626220
 ] 

Maruan Sahyoun commented on PDFBOX-4252:


AFAIK Acrobat advances the position so the first selected item is visible in 
case it wouldn't have been without the advancement (would need to verify this). 
So we should do the same.

[~tilman] Wouldn't it make sense to have individual appearance handlers for 
form elements. This way the code becomes more readable. Currently everything is 
interwoven and not very clean. We could introduce it step by step for 
individual widget types.

> PDChoice related bugs and issues
> 
>
> Key: PDFBOX-4252
> URL: https://issues.apache.org/jira/browse/PDFBOX-4252
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.10
>Reporter: Xin Lin
>Priority: Major
>  Labels: Appearance
> Attachments: test listbox and droplist export values.pdf, test 
> listbox_and_droplist export values flattened.pdf, top index 0 flattened.pdf, 
> top index 0 with index 4 set as selected.pdf, top index 0.pdf, top index 3.pdf
>
>
> There are several issues related to either PDListBox or PDComboBox that are 
> still not fixed in 2.0.10, I am going to put them in this one case, let me 
> know if they need to be broken into separate tickets. Thanks.
> 1. When I attempt to set value to a PDListBox whose 'Top Index' is greater 
> than 0 (as in [^top index 3.pdf]), I always receive the following exception:
> {noformat}
> java.lang.IllegalStateException: Error: You must call beginText() before 
> calling endText.
>   at 
> org.apache.pdfbox.pdmodel.PDPageContentStream.endText(PDPageContentStream.java:381)
>  ~[pdfbox-2.0.10.jar:2.0.10]
> ...
> {noformat}
> I tracked it down to a for loop at the end of the private method 
> insertGeneratedListboxAppearance in the class AppearanceGeneratorHelper:
> {code:java}
>  for (int i = topIndex; i < numOptions; i++)
> {
>
> if (i == topIndex)
> {
> yTextPos = yTextPos - font.getFontDescriptor().getAscent() / 
> FONTSCALE * fontSize;
> }
> else
> {
> yTextPos = yTextPos - font.getBoundingBox().getHeight() / 
> FONTSCALE * fontSize;
> contents.beginText();
> }
> contents.newLineAtOffset(contentRect.getLowerLeftX(), yTextPos);
> contents.showText(options.get(i));
> if (i - topIndex != (numOptions - 1))
> {
> contents.endText();
> }
> }
> {code}
> The last 'if' clause, when topIndex == 0, this makes sense, which is to NOT 
> call endText if we are at the last option because the private method 
> insertGeneratedAppearance which calls insertGeneratedListboxAppearance would 
> later call endText once again. If topIndex > 0, the condition in this 'if' 
> clause would always be true (since i can never be greater than numOptions - 
> 1), as a result, endText is called every time in this 'for' loop, so after 
> the method returns and the next endText is called, we receive the exception. 
> If I change that to
> {code:java}
>  if (i != (numOptions - 1))
> {
> contents.endText();
> }
> {code}
> things would start to work again.
> 2. a related issue, suppose I have a list box with top index equals to 0 and 
> too many options for the list box to show all of them at once (as in [^top 
> index 0.pdf]). When I select an option that is not visible with a top index 
> of 0 (in order to see it, we need to scroll the list box down), unlike 
> Acrobat which would adjust the top index so the selected option would be 
> visible, PDFBox does not recalculate the top index and would stick with the 
> initial value of 0. I suppose, If we fix item #1 above (i.e. top index 
> greater than 0), the opposite would also be true (if top index is say 6 and 
> you select the first item which is not visible unless you scroll the list box 
> up). This would make it useless if I flatten the document since there is no 
> way I can see the selected option (see  [^top index 0 flattened.pdf]). It is 
> also next to impossible to see the selected option even if I do not flatten 
> it (see  [^top index 0 with index 4 set as selected.pdf]). I would expect 
> PDFBox to recalculate the top index so that at least the first selected 
> option is visible (if there are additional selected options, show more 
> options when possible)
> 3. When flattening, drop down list or PDComboBox with options that have 
> export values only shows the export values instead of the label. This not a 
> problem for the list box. (e.g. form:  [^test listbox and droplist export 
> values.pdf], after flattening:  

[jira] [Commented] (PDFBOX-4252) PDChoice related bugs and issues

2018-09-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626201#comment-16626201
 ] 

Tilman Hausherr commented on PDFBOX-4252:
-

Re (2), I wonder if it is needed to add logic into that: the user could easily 
decide to make the first selected the top value by calling {{setTopIndex()}}. 
If we do something ourselves, what should we do? Set the first selected 
visible? Or a strategy to change something only if no selected item is visible? 
Include it by default or as an extra method? Or just mention it in the javadoc?

> PDChoice related bugs and issues
> 
>
> Key: PDFBOX-4252
> URL: https://issues.apache.org/jira/browse/PDFBOX-4252
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.10
>Reporter: Xin Lin
>Priority: Major
>  Labels: Appearance
> Attachments: test listbox and droplist export values.pdf, test 
> listbox_and_droplist export values flattened.pdf, top index 0 flattened.pdf, 
> top index 0 with index 4 set as selected.pdf, top index 0.pdf, top index 3.pdf
>
>
> There are several issues related to either PDListBox or PDComboBox that are 
> still not fixed in 2.0.10, I am going to put them in this one case, let me 
> know if they need to be broken into separate tickets. Thanks.
> 1. When I attempt to set value to a PDListBox whose 'Top Index' is greater 
> than 0 (as in [^top index 3.pdf]), I always receive the following exception:
> {noformat}
> java.lang.IllegalStateException: Error: You must call beginText() before 
> calling endText.
>   at 
> org.apache.pdfbox.pdmodel.PDPageContentStream.endText(PDPageContentStream.java:381)
>  ~[pdfbox-2.0.10.jar:2.0.10]
> ...
> {noformat}
> I tracked it down to a for loop at the end of the private method 
> insertGeneratedListboxAppearance in the class AppearanceGeneratorHelper:
> {code:java}
>  for (int i = topIndex; i < numOptions; i++)
> {
>
> if (i == topIndex)
> {
> yTextPos = yTextPos - font.getFontDescriptor().getAscent() / 
> FONTSCALE * fontSize;
> }
> else
> {
> yTextPos = yTextPos - font.getBoundingBox().getHeight() / 
> FONTSCALE * fontSize;
> contents.beginText();
> }
> contents.newLineAtOffset(contentRect.getLowerLeftX(), yTextPos);
> contents.showText(options.get(i));
> if (i - topIndex != (numOptions - 1))
> {
> contents.endText();
> }
> }
> {code}
> The last 'if' clause, when topIndex == 0, this makes sense, which is to NOT 
> call endText if we are at the last option because the private method 
> insertGeneratedAppearance which calls insertGeneratedListboxAppearance would 
> later call endText once again. If topIndex > 0, the condition in this 'if' 
> clause would always be true (since i can never be greater than numOptions - 
> 1), as a result, endText is called every time in this 'for' loop, so after 
> the method returns and the next endText is called, we receive the exception. 
> If I change that to
> {code:java}
>  if (i != (numOptions - 1))
> {
> contents.endText();
> }
> {code}
> things would start to work again.
> 2. a related issue, suppose I have a list box with top index equals to 0 and 
> too many options for the list box to show all of them at once (as in [^top 
> index 0.pdf]). When I select an option that is not visible with a top index 
> of 0 (in order to see it, we need to scroll the list box down), unlike 
> Acrobat which would adjust the top index so the selected option would be 
> visible, PDFBox does not recalculate the top index and would stick with the 
> initial value of 0. I suppose, If we fix item #1 above (i.e. top index 
> greater than 0), the opposite would also be true (if top index is say 6 and 
> you select the first item which is not visible unless you scroll the list box 
> up). This would make it useless if I flatten the document since there is no 
> way I can see the selected option (see  [^top index 0 flattened.pdf]). It is 
> also next to impossible to see the selected option even if I do not flatten 
> it (see  [^top index 0 with index 4 set as selected.pdf]). I would expect 
> PDFBox to recalculate the top index so that at least the first selected 
> option is visible (if there are additional selected options, show more 
> options when possible)
> 3. When flattening, drop down list or PDComboBox with options that have 
> export values only shows the export values instead of the label. This not a 
> problem for the list box. (e.g. form:  [^test listbox and droplist export 
> values.pdf], after flattening:  [^test listbox_and_droplist export values 
> 

[jira] [Commented] (PDFBOX-4252) PDChoice related bugs and issues

2018-09-24 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626193#comment-16626193
 ] 

ASF subversion and git services commented on PDFBOX-4252:
-

Commit 1841882 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1841882 ]

PDFBOX-4252: clarify name

> PDChoice related bugs and issues
> 
>
> Key: PDFBOX-4252
> URL: https://issues.apache.org/jira/browse/PDFBOX-4252
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.10
>Reporter: Xin Lin
>Priority: Major
>  Labels: Appearance
> Attachments: test listbox and droplist export values.pdf, test 
> listbox_and_droplist export values flattened.pdf, top index 0 flattened.pdf, 
> top index 0 with index 4 set as selected.pdf, top index 0.pdf, top index 3.pdf
>
>
> There are several issues related to either PDListBox or PDComboBox that are 
> still not fixed in 2.0.10, I am going to put them in this one case, let me 
> know if they need to be broken into separate tickets. Thanks.
> 1. When I attempt to set value to a PDListBox whose 'Top Index' is greater 
> than 0 (as in [^top index 3.pdf]), I always receive the following exception:
> {noformat}
> java.lang.IllegalStateException: Error: You must call beginText() before 
> calling endText.
>   at 
> org.apache.pdfbox.pdmodel.PDPageContentStream.endText(PDPageContentStream.java:381)
>  ~[pdfbox-2.0.10.jar:2.0.10]
> ...
> {noformat}
> I tracked it down to a for loop at the end of the private method 
> insertGeneratedListboxAppearance in the class AppearanceGeneratorHelper:
> {code:java}
>  for (int i = topIndex; i < numOptions; i++)
> {
>
> if (i == topIndex)
> {
> yTextPos = yTextPos - font.getFontDescriptor().getAscent() / 
> FONTSCALE * fontSize;
> }
> else
> {
> yTextPos = yTextPos - font.getBoundingBox().getHeight() / 
> FONTSCALE * fontSize;
> contents.beginText();
> }
> contents.newLineAtOffset(contentRect.getLowerLeftX(), yTextPos);
> contents.showText(options.get(i));
> if (i - topIndex != (numOptions - 1))
> {
> contents.endText();
> }
> }
> {code}
> The last 'if' clause, when topIndex == 0, this makes sense, which is to NOT 
> call endText if we are at the last option because the private method 
> insertGeneratedAppearance which calls insertGeneratedListboxAppearance would 
> later call endText once again. If topIndex > 0, the condition in this 'if' 
> clause would always be true (since i can never be greater than numOptions - 
> 1), as a result, endText is called every time in this 'for' loop, so after 
> the method returns and the next endText is called, we receive the exception. 
> If I change that to
> {code:java}
>  if (i != (numOptions - 1))
> {
> contents.endText();
> }
> {code}
> things would start to work again.
> 2. a related issue, suppose I have a list box with top index equals to 0 and 
> too many options for the list box to show all of them at once (as in [^top 
> index 0.pdf]). When I select an option that is not visible with a top index 
> of 0 (in order to see it, we need to scroll the list box down), unlike 
> Acrobat which would adjust the top index so the selected option would be 
> visible, PDFBox does not recalculate the top index and would stick with the 
> initial value of 0. I suppose, If we fix item #1 above (i.e. top index 
> greater than 0), the opposite would also be true (if top index is say 6 and 
> you select the first item which is not visible unless you scroll the list box 
> up). This would make it useless if I flatten the document since there is no 
> way I can see the selected option (see  [^top index 0 flattened.pdf]). It is 
> also next to impossible to see the selected option even if I do not flatten 
> it (see  [^top index 0 with index 4 set as selected.pdf]). I would expect 
> PDFBox to recalculate the top index so that at least the first selected 
> option is visible (if there are additional selected options, show more 
> options when possible)
> 3. When flattening, drop down list or PDComboBox with options that have 
> export values only shows the export values instead of the label. This not a 
> problem for the list box. (e.g. form:  [^test listbox and droplist export 
> values.pdf], after flattening:  [^test listbox_and_droplist export values 
> flattened.pdf]). I would expect the drop down list to behave the same as list 
> box (i.e. when flattened, it should also show the label instead of the export 
> value.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PDFBOX-4252) PDChoice related bugs and issues

2018-09-24 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626194#comment-16626194
 ] 

ASF subversion and git services commented on PDFBOX-4252:
-

Commit 1841883 from til...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1841883 ]

PDFBOX-4252: clarify name

> PDChoice related bugs and issues
> 
>
> Key: PDFBOX-4252
> URL: https://issues.apache.org/jira/browse/PDFBOX-4252
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.10
>Reporter: Xin Lin
>Priority: Major
>  Labels: Appearance
> Attachments: test listbox and droplist export values.pdf, test 
> listbox_and_droplist export values flattened.pdf, top index 0 flattened.pdf, 
> top index 0 with index 4 set as selected.pdf, top index 0.pdf, top index 3.pdf
>
>
> There are several issues related to either PDListBox or PDComboBox that are 
> still not fixed in 2.0.10, I am going to put them in this one case, let me 
> know if they need to be broken into separate tickets. Thanks.
> 1. When I attempt to set value to a PDListBox whose 'Top Index' is greater 
> than 0 (as in [^top index 3.pdf]), I always receive the following exception:
> {noformat}
> java.lang.IllegalStateException: Error: You must call beginText() before 
> calling endText.
>   at 
> org.apache.pdfbox.pdmodel.PDPageContentStream.endText(PDPageContentStream.java:381)
>  ~[pdfbox-2.0.10.jar:2.0.10]
> ...
> {noformat}
> I tracked it down to a for loop at the end of the private method 
> insertGeneratedListboxAppearance in the class AppearanceGeneratorHelper:
> {code:java}
>  for (int i = topIndex; i < numOptions; i++)
> {
>
> if (i == topIndex)
> {
> yTextPos = yTextPos - font.getFontDescriptor().getAscent() / 
> FONTSCALE * fontSize;
> }
> else
> {
> yTextPos = yTextPos - font.getBoundingBox().getHeight() / 
> FONTSCALE * fontSize;
> contents.beginText();
> }
> contents.newLineAtOffset(contentRect.getLowerLeftX(), yTextPos);
> contents.showText(options.get(i));
> if (i - topIndex != (numOptions - 1))
> {
> contents.endText();
> }
> }
> {code}
> The last 'if' clause, when topIndex == 0, this makes sense, which is to NOT 
> call endText if we are at the last option because the private method 
> insertGeneratedAppearance which calls insertGeneratedListboxAppearance would 
> later call endText once again. If topIndex > 0, the condition in this 'if' 
> clause would always be true (since i can never be greater than numOptions - 
> 1), as a result, endText is called every time in this 'for' loop, so after 
> the method returns and the next endText is called, we receive the exception. 
> If I change that to
> {code:java}
>  if (i != (numOptions - 1))
> {
> contents.endText();
> }
> {code}
> things would start to work again.
> 2. a related issue, suppose I have a list box with top index equals to 0 and 
> too many options for the list box to show all of them at once (as in [^top 
> index 0.pdf]). When I select an option that is not visible with a top index 
> of 0 (in order to see it, we need to scroll the list box down), unlike 
> Acrobat which would adjust the top index so the selected option would be 
> visible, PDFBox does not recalculate the top index and would stick with the 
> initial value of 0. I suppose, If we fix item #1 above (i.e. top index 
> greater than 0), the opposite would also be true (if top index is say 6 and 
> you select the first item which is not visible unless you scroll the list box 
> up). This would make it useless if I flatten the document since there is no 
> way I can see the selected option (see  [^top index 0 flattened.pdf]). It is 
> also next to impossible to see the selected option even if I do not flatten 
> it (see  [^top index 0 with index 4 set as selected.pdf]). I would expect 
> PDFBox to recalculate the top index so that at least the first selected 
> option is visible (if there are additional selected options, show more 
> options when possible)
> 3. When flattening, drop down list or PDComboBox with options that have 
> export values only shows the export values instead of the label. This not a 
> problem for the list box. (e.g. form:  [^test listbox and droplist export 
> values.pdf], after flattening:  [^test listbox_and_droplist export values 
> flattened.pdf]). I would expect the drop down list to behave the same as list 
> box (i.e. when flattened, it should also show the label instead of the export 
> value.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache PDFBox JBIG2 ImageIO 3.0.2

2018-09-24 Thread Maruan Sahyoun
+1

Maruan
 
> Hi,
> 
> a candidate for the PDFBox JBIG2 ImageIO 3.0.2 release is available at:
> 
>  https://dist.apache.org/repos/dist/dev/pdfbox/jbig2-imageio-3.0.2/
> 
> The release candidate is a zip archive of the sources in:
> 
>  https://github.com/apache/pdfbox-jbig2/tree/jbig2-imageio-3.0.2/
> 
> The SHA-512 checksum of the archive is 
> 9a89ebefc13d23ec1b5787f836764b4d9f8793b08f4f5ff3c3fbb310b6b033dd880dac6f3830ab95e086c9efa07434a43fa0d30587b7cb4c1edb4a1ef017f5fe.
> 
> Please vote on releasing this package as Apache PDFBox JBIG2 ImageIO 3.0.2.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 PDFBox PMC votes are cast.
> 
>  [ ] +1 Release this package as Apache PDFBox JBIG2 ImageIO 3.0.2
>  [ ] -1 Do not release this package because...
> 
> Here his my +1
> 
> Andreas
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahy...@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build is back to normal : PDFBox-trunk #4217

2018-09-24 Thread Apache Jenkins Server
See 


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626164#comment-16626164
 ] 

ASF subversion and git services commented on PDFBOX-4322:
-

Commit 1841879 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1841879 ]

PDFBOX-4322: add test file

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3919) Infinite loop while parsing (2)

2018-09-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/PDFBOX-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626143#comment-16626143
 ] 

Andreas Lehmkühler commented on PDFBOX-3919:


{quote}Is it possible to get a CVE assigned to this issue?{quote}
I'm not fond of creating CVEs for issues which are known for years. But we 
fixed CVE-2018-8036 with 1.8.15 respectively 2.0.11 so that you should have at 
least one argument to update PDFBox to a recent version, HTH

> Infinite loop while parsing (2)
> ---
>
> Key: PDFBOX-3919
> URL: https://issues.apache.org/jira/browse/PDFBOX-3919
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.13, 2.0.7
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 1.8.14, 2.0.8, 3.0.0 PDFBox
>
>
> See linked article by [~hanno] - we're affected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Fwd: Re: [VOTE] Release Apache PDFBox JBIG2 ImageIO 3.0.2

2018-09-24 Thread Andreas Lehmkuehler





 Weitergeleitete Nachricht 
Betreff: Re: [VOTE] Release Apache PDFBox JBIG2 ImageIO 3.0.2
Datum: Mon, 24 Sep 2018 09:23:44 +0200
Von: Jörg Henne 
An: Andreas Lehmkuehler 


Am 22.09.2018 um 17:54 schrieb Andreas Lehmkuehler:

Hi,

a candidate for the PDFBox JBIG2 ImageIO 3.0.2 release is available at:

https://dist.apache.org/repos/dist/dev/pdfbox/jbig2-imageio-3.0.2/

The release candidate is a zip archive of the sources in:

https://github.com/apache/pdfbox-jbig2/tree/jbig2-imageio-3.0.2/

The SHA-512 checksum of the archive is 
9a89ebefc13d23ec1b5787f836764b4d9f8793b08f4f5ff3c3fbb310b6b033dd880dac6f3830ab95e086c9efa07434a43fa0d30587b7cb4c1edb4a1ef017f5fe.


Please vote on releasing this package as Apache PDFBox JBIG2 ImageIO 
3.0.2.

The vote is open for the next 72 hours and passes if a majority of at
least three +1 PDFBox PMC votes are cast.

    [ ] +1 Release this package as Apache PDFBox JBIG2 ImageIO 3.0.2
    [ ] -1 Do not release this package because...

+1!

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4322:

Attachment: PDFBOX-4322-Empty-ToUnicode-reduced.pdf

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: PDFBOX-4322-Empty-ToUnicode-reduced.pdf, pdf__1.pdf, 
> pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626076#comment-16626076
 ] 

Tim Allison edited comment on PDFBOX-4322 at 9/24/18 4:12 PM:
--

Thank you, [~tilman]! I'm happy to run the regression tests if there's any 
interest in moving this into the next release.


was (Author: talli...@mitre.org):
I'm happy to run the regression tests if there's any interest in moving this 
into the next release.

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626076#comment-16626076
 ] 

Tim Allison commented on PDFBOX-4322:
-

I'm happy to run the regression tests if there's any interest in moving this 
into the next release.

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Build failed in Jenkins: PDFBox-trunk #4216

2018-09-24 Thread Apache Jenkins Server
See 

--
[...truncated 8.22 KB...]
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:937)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:864)
at hudson.scm.SCM.checkout(SCM.java:504)
at 
hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at 
jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1794)
at 
hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at 
hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Caused: java.io.IOException
at 
hudson.scm.subversion.UpdateUpdater$TaskImpl.perform(UpdateUpdater.java:216)
at 
hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:168)
at 
hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:1041)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:1017)
at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:990)
at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2918)
at hudson.remoting.UserRequest.perform(UserRequest.java:212)
at hudson.remoting.UserRequest.perform(UserRequest.java:54)
at hudson.remoting.Request$2.run(Request.java:369)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused: java.io.IOException: remote file operation failed: 
 at 
hudson.remoting.Channel@74080f34:H25
at hudson.FilePath.act(FilePath.java:1043)
at hudson.FilePath.act(FilePath.java:1025)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:937)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:864)
at hudson.scm.SCM.checkout(SCM.java:504)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
at hudson.model.Run.execute(Run.java:1794)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
Retrying after 10 seconds
Cleaning up 
ERROR: Failed to update http://svn.apache.org/repos/asf/pdfbox/trunk
org.tmatesoft.svn.core.SVNException: svn: E155032: The pristine text with 
checksum '$sha1$0d8a658d1ecd3fa00dd8f46d24a9d2fdef18f3e4' was found in the DB 
but not on disk
at 
org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:70)
at 
org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:57)
at 
org.tmatesoft.svn.core.internal.wc17.db.SvnWcDbPristines.checkPristine(SvnWcDbPristines.java:159)
at 
org.tmatesoft.svn.core.internal.wc17.db.SvnWcDbPristines.getPristinePath(SvnWcDbPristines.java:184)
at 
org.tmatesoft.svn.core.internal.wc17.db.SVNWCDb.getPristinePath(SVNWCDb.java:1724)
at 
org.tmatesoft.svn.core.internal.wc17.SVNWCContext.isTextModified(SVNWCContext.java:754)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.assembleStatus(SVNStatusEditor17.java:356)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.sendStatusStructure(SVNStatusEditor17.java:216)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.getDirStatus(SVNStatusEditor17.java:742)
at 
org.tmatesoft.svn.core.internal.wc17.SVNStatusEditor17.walkStatus(SVNStatusEditor17.java:665)
at 
org.tmatesoft.svn.core.internal.wc2.ng.SvnNgGetStatus.run(SvnNgGetStatus.java:132)
at 
org.tmatesoft.svn.core.internal.wc2.ng.SvnNgGetStatus.run(SvnNgGetStatus.java:27)
at 
org.tmatesoft.svn.core.internal.wc2.ng.SvnNgOperationRunner.run(SvnNgOperationRunner.java:20)
at 
org.tmatesoft.svn.core.internal.wc2.SvnOperationRunner.run(SvnOperationRunner.java:21)
at 

[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626028#comment-16626028
 ] 

Tilman Hausherr commented on PDFBOX-4322:
-

Fixed for the trunk only {color:#33}for now {color}and targeted for the 
second-next release because this should be tested in the regression tests.

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4322:

Affects Version/s: 2.0.11

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625980#comment-16625980
 ] 

ASF subversion and git services commented on PDFBOX-4322:
-

Commit 1841866 from til...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1841866 ]

PDFBOX-4322: treat identity ToUnicode streams that are empty as identity

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4322:

Fix Version/s: 2.0.13

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2, 2.0.11
>Reporter: Amit Maheshwari
>Priority: Major
> Fix For: 2.0.13
>
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tilman Hausherr (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625959#comment-16625959
 ] 

Tilman Hausherr commented on PDFBOX-4322:
-

No, it's a problem with the ToUnicode stream. It is empty. However it has the 
name Identity-H so I'll try to use that.

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2
>Reporter: Amit Maheshwari
>Priority: Major
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625906#comment-16625906
 ] 

Tim Allison commented on PDFBOX-4322:
-

I haven't had a chance to try this with pure PDFBox yet, but I can confirm that 
we're not getting the info in Tika 1.19: [^pdf__1.pdf.xml]  We do try to 
process the AcroForms and XFA (this doc doesn't appear to have XFA)...perhaps 
we're not doing it right?

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2
>Reporter: Amit Maheshwari
>Priority: Major
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-4322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-4322:

Attachment: pdf__1.pdf.xml

> Extract Text feature is not working for some part of PDF
> 
>
> Key: PDFBOX-4322
> URL: https://issues.apache.org/jira/browse/PDFBOX-4322
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.2
>Reporter: Amit Maheshwari
>Priority: Major
> Attachments: pdf__1.pdf, pdf__1.pdf.xml
>
>
> Text Extraction feature cannot extract text from attached pdf properly.
>  
> Text inside of rectangle box (e.g value of Lending Specialist and others) is 
> not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-4322) Extract Text feature is not working for some part of PDF

2018-09-24 Thread Amit Maheshwari (JIRA)
Amit Maheshwari created PDFBOX-4322:
---

 Summary: Extract Text feature is not working for some part of PDF
 Key: PDFBOX-4322
 URL: https://issues.apache.org/jira/browse/PDFBOX-4322
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.2
Reporter: Amit Maheshwari
 Attachments: pdf__1.pdf

Text Extraction feature cannot extract text from attached pdf properly.

 

Text inside of rectangle box (e.g value of Lending Specialist and others) is 
not getting extracted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org