from:"Tilman Hausherr"



[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327
 ] 

Tilman Hausherr commented on PDFBOX-5879:
-

I added a simple test for the feature because it turns out we didn't have any. 
However this isn't a test of the fixed bug, that would have been more difficult 
to create a file, and there is no risk that this fix gets reverted anyway.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327
 ] 

Tilman Hausherr edited comment on PDFBOX-5879 at 9/17/24 9:08 AM:
--

I added a simple test for the rotationMagic feature because it turns out we 
didn't have any. However this isn't a test of the fixed bug, that would have 
been more difficult to create a file, and there is no risk that this fix gets 
reverted anyway.


was (Author: tilman):
I added a simple test for the feature because it turns out we didn't have any. 
However this isn't a test of the fixed bug, that would have been more difficult 
to create a file, and there is no risk that this fix gets reverted anyway.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>        Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5879.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

Thank you. It's not the commit, it's poor programming that got exposed because 
of the commit.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5879:

Affects Version/s: 2.0.32

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882240#comment-17882240
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

Wow!

No regressions.

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-15 Thread Tilman Hausherr (Jira)

[
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tilman Hausherr updated PDFBOX-5852:

Description:
We've observed excessive CPU and memory consumption when converting a PDF to
images when the PDF contains type 4 shading. This is especially noticeable
when the conversion is done with a high DPI. Can this be improved?

Conversation from the PDFBox users mailing list follows

Initial email:
{quote}
Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox
users and maintainers,

We have a PDF that causes performance problems when we use PDFBox to
convert it to an image with renderImageWithDPI(). We're calling
renderImageWithDPI()
with 650 DPI. I realize this is a very high value - we're using it for
high fidelity original images that will later be downsampled. On my work
laptop which has fairly strong hardware, the conversion takes 25 minutes
and consumes 20GB of memory. CPU and memory usage is reduced if we use a
lower DPI.

The PDF is 1 page long. It contains type 4 shading / Gouraud free form
triangle meshes. We've been aware of some performance issues with type 4
shading for a little while now, but the PDFs that contained the type 4
shading belonged to our customers and we were not authorized to share
them. We finally found a problem input document that is non-sensitive and
that we are authorized to share. I've attached a copy of the problem PDF
to this email.

I searched the archives for the users and the developers mailing list and I
didn't find anything specifically about this issue.
I searched through the PDFBox jira tickets and I found a couple of tickets
that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most
closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
and our issue still reproduces with PDFBox 2.0.28.

Should I refer this issue over to the developers mailing list or create a
PDFBox Jira ticket for this?

Thanks and Regards,
Larry Lynn {quote}
Response:
{quote}
Hi,

Yes shading can be very slow, especially at high dpi. The attachment
didn't get through, please upload to a sharehoster or create a ticket.
If you need to register then add a meaningful text, e.g. the subject of
this post so we know you're not a spammer. Also retry with 2.0.31 and
3.0.2 just to be sure. However I'm pessimistic that this can be fixed.

Tilman {quote}

was:
We've observed excessive CPU and memory consumption when converting a PDF to
images when the PDF contains type 4 shading. This is especially noticeable
when the conversion is done with a high DPI. Can this be improved?

Conversation from the PDFBox users mailing list follows

Initial email:
{code:java}
Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox
users and maintainers,

Should I refer this issue over to the developers mailing list or create a
PDFBox Jira ticket for this?

Thanks and Regards,
Larry Lynn {code}
Response:
{code:java}
Hi,

Tilman {code}

> Hi CPU and memory usage when converting a PDF with type 4 shading
> --

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:35 AM:
--

Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 
The only problem left is that the second multiline field starts a bit too low, 
but IIRC there's another issue about that.


was (Author: tilman):
Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:31 AM:
--

Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 


was (Author: tilman):
Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
acroForm.refreshAppearances();
}
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5878:

Attachment: PDFBox5878-flattened.pdf
PDFBox5878-saved.pdf

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879822#comment-17879822
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 9:30 AM:
-

I added this for the missing fonts, which is just a guess that it's the correct 
font
{code:java}
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
acroForm.setNeedAppearances(false);
PDFont font1 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/times.ttf"), false);
PDFont font2 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/timesbd.ttf"), false);
PDFont font3 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/arial.ttf"), false);
acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPSMT"), 
font1);
acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPS-BoldMT"),
 font2);
acroForm.getDefaultResources().put(COSName.getPDFName("Helvetica"), font3);
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (((PDTextField) field).isMultiline())
{
field.setValue("XXX");
}
}
}
{code}
But when setting a value, this happens in 
AppearanceGeneratorHelper.setAppearanceContent():
{code}
if (bmcIndex == -1)
{
// append to existing stream
writer.writeTokens(tokens);
writer.writeTokens(COSName.TX, BMC);
}
{code}
So it appends to the existing appearance steam. This is the result after 
calling setValue("XXX"):
{code}
q
Q
q
  9.613575 0.4609071 430.9062 41.31819 re
  W
  n
  q
0.9781767 0 0 -0.9781767 -87.43936 478.0107 cm
BT
  11 0 0 -11 102.2182 458.5622 Tm
  /TT21 1 Tf
  [ (N) -0.2 (a) 0.2 (m) 0.2 (e) 0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d 09/) 
0.2 (26/) 0.2 (2020) ] TJ
ET
  Q
Q
q
  6.43259 0.3084 434.0872 41.6232 re
  W
  n
  q
0.9853977 0 0 0.9853977 9.388783 29.51731 cm
BT
  11 0 0 11 0 0 Tm
  /TT18 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
0.9853977 0 0 0.9853977 9.388783 17.51355 cm
BT
  11 0 0 11 0 0 Tm
  /TT18 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
q
  3.228123 0.1547671 437.2917 41.93047 re
  W
  n
  q
0.992672 0 0 0.992672 6.206139 29.5793 cm
BT
  11 0 0 11 0 0 Tm
  /TT19 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
0.992672 0 0 0.992672 6.206139 17.48693 cm
BT
  11 0 0 11 0 0 Tm
  /TT19 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
q
  0 0 440.5198 42.24 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 1 3 29.64175 cm
BT
  11 0 0 11 0 0 Tm
  /TT20 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
1 0 0 1 3 17.46011 cm
BT
  11 0 0 11 0 0 Tm
  /TT20 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
/Tx BMC
  q
-2.252 1 441.7718 40.24 re
W
n
BT
  /TimesNewRomanPSMT 11 Tf
  /DeviceGray cs
  0 sc
  -1.252 25.4319 Td
  (\000;\000;\000;) Tj
ET
  Q
EMC
{code}
So the XXX is there, but also all the previous content.


was (Author: tilman):
I added this for the missing fonts, which is just a guess that it's the correct 
font
{code:java}
acroForm.setNeedAppearances(false);
PDFont font1 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/times.ttf"), false);
PDFont font2

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879796#comment-17879796
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 8:00 AM:
-

There are so many things wrong with this PDF that I don't see a specific 
solution. I'm doing this just for fun. I was able to fix some of the fields 
(e.g. Last1) but not yet all (e.g. the multiline fields and some others), for 
some unknown reason. (I added the missing fonts to the default resources) Not 
all appearances are redrawn. Either there's a bug in my code or there is 
something in our code that skips the recreation of the appearances and I forgot 
about it.

It's not even recreated when changing to the value to something else?!


was (Author: tilman):
There are so many things wrong with this PDF that I don't see a specific 
solution. I'm doing this just for fun. I was able to fix some of the fields 
(e.g. Last1) but not yet all (e.g. the multiline fields and some others), for 
some unknown reason. (I added the missing fonts to the default resources) Not 
all appearances are redrawn. Either there's a bug in my code or there is 
something in our code that skips the recreation of the appearances and I forgot 
about it.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879753#comment-17879753
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 4:04 AM:
-

I could try to getValue() and setValue() on the text fields and see whether it 
looks better when PDFBox recreates the appearances. These fields have a value 
that makes sense. I'm just wondering whether this person will have legal 
disadvantages if the file is refused? (Although I doubt that the content of 
field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH 
it's from 22.2 so it may already have been decided in some way.


was (Author: tilman):
I could try to getValue() and setValue() on the text fields and see whether it 
looks better when PDFBox recreates the appearances. These fields have a value 
that makes sense. I'm just wondering whether this person will have legal 
disadvantages if the file is refused? (Although I doubt that the content of 
field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH 
it's from 22.2 so it may already have been processed.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879480#comment-17879480
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/5/24 8:16 AM:
-

{code}
q
Q
q
  9.469598 0.4248199 206.7517 18.55036 re
  W
  n
  q
0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT21 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  6.360067 0.2853218 209.8612 18.82936 re
  W
  n
  q
0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT18 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  3.203769 0.1437257 213.0175 19.11255 re
  W
  n
  q
0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT19 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  0 0 216.2213 19.4 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 -1 -68.0727 703.247 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT20 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
{code}
The text appears 3 times at slightly different positions in this appearance 
stream.


was (Author: tilman):
{code}
q
Q
q
  9.469598 0.4248199 206.7517 18.55036 re
  W
  n
  q
0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT21 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  6.360067 0.2853218 209.8612 18.82936 re
  W
  n
  q
0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT18 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  3.203769 0.1437257 213.0175 19.11255 re
  W
  n
  q
0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT19 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  0 0 216.2213 19.4 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 -1 -68.0727 703.247 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT20 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
{code}
The text appears 3 times.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-04 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-5876:
-

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878964#comment-17878964
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

Yeah!! There's a log message, so it means you also disabled or disregarded logs 
:-(

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961
 ] 

Tilman Hausherr edited comment on PDFBOX-5877 at 9/3/24 5:55 PM:
-

What's this?
{code}
pdDocument.setResourceCache(new PdfResourceCache())
{code}
We have no class {{PdfResourceCache}}.


was (Author: tilman):
What's this?

pdDocument.setResourceCache(new PdfResourceCache())



> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

What's this?

pdDocument.setResourceCache(new PdfResourceCache())



> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878960#comment-17878960
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

Are you sure you used 3.0.3 and not 3.0.2 ? I just tried with the trunk and 
3.0.4-SNAPSHOT with our test and I got only invisible differences (yours are 
clearly visible and are because all fonts are lost in the PDF)

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878879#comment-17878879
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

No... I used -Xmx4G for a production project.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878846#comment-17878846
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

Are you sure you are using the new version? You have to build yourself or wait 
until a new snapshot build is available. Instead of using PDFDebugger now I 
just tried your code as it is with a locally built 3.0.4-SNAPSHOT and it did 
work with -Xmx600m. (Also with 550, but not with 500)

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5876.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5876:

Affects Version/s: 2.0.32

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5876:

Component/s: Rendering

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878835#comment-17878835
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

The JPX image in that file is 7020 x 4964, which is quite big, and -Xmx600m is 
quite low. But I noticed that the subsampling parameter wasn't used when 
reading the JPX image the second time, which was the cause for the OOM. (JPX 
images have to be read twice because of some weirdness in the specification) It 
should work now, I tried it with PDFDebugger, which doesn't allow to set a temp 
cache.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5875) using font data to process ligatures



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5875:

Fix Version/s: (was: 3.0.4 PDFBox)

> using font data to process ligatures
> 
>
> Key: PDFBOX-5875
> URL: https://issues.apache.org/jira/browse/PDFBOX-5875
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Parsing, PDModel, Text extraction
>Affects Versions: 3.0.3 PDFBox
>Reporter: Manish S N
>Priority: Major
>  Labels: Asian, CIDFont, font, ligatures, unicodemapping
> Attachments: page.pdf
>
>
> To process ligatures from Asian languages (where a glyph is the combination 
> of two unicode characters) using the data in embedded fonts.
>  
> *The problem:*
> currently modern PDF creators put these ligatures in /ActualText field which 
> we only recently considered to support in this issue . But this is not the 
> case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of 
> ligatures lack a /toUnicode character mapping because there is no single 
> unicode codepoint for these as these are combination of more than one unicode 
> characters. 
>  
> *The Potential Solution (if not perfect):* 
> I managed to extract the font files using pdfbox 
> ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
>  and when i viewed the fontfiles using fontforge i found the data about 
> ligatures intact in it. So we can use this data to map the glyphs that are 
> ligatures to the unicodes of its constituent glyphs
>  
> *Problems:*
> In some cases the constituent glyphs may not be present in the cmap at all. 
> removed by PDF optimiser as it is never directly used in the PDF apart from 
> in ligatures. such glyphs are empty with only glyph id and no /toUnicode 
> mapping even if that particular glyph has a corresponding unicode character.
>  
> *The Hope:*
> This is not a common problem in large PDFs. and basic spell checkers could 
> easily rectify the problem. some comprehension is better than no 
> comprehension when it comes to dealing with data. this will greatly enhance 
> the parsing of non-Latin Asian languages.
>  
> (the PDF sample i attached is in Tamil language)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878089#comment-17878089
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Yes. But consider that Adobe didn't do it and they're smarter than us, I just 
tried copy / paste and save as text. The ligature thing in fonts are meant to 
be used when creating PDFs, I don't know if these would work in extraction.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/30/24 11:50 AM:
---

Please create a new ticket for the file you just added because this is a 
different problem (only if you manage to extract this properly from Adobe 
Reader).


was (Author: tilman):
Please create a new ticket for the file you just added because this is a 
different problem.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Please create a new ticket for the file you just added because this is a 
different problem.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5874.
-
  Assignee: Tilman Hausherr
Resolution: Fixed

Thank you, you're right, there's no need to warn about something that harmless.

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5874:

Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5874:

Affects Version/s: 2.0.32

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Priority: Minor
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876692#comment-17876692
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

In the files I saw /ActualText was often used only for a part of the text 
(although I see that one of the files I attached uses it for all). Using 
/ActualText only and disregard the old text extraction was never in my 
thoughts. That's why a switch would mean we either have the improvement of this 
ticket, or work as before.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876660#comment-17876660
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

I haven't resolved this ticket because of one question I've been asking to 
myself and now to the users here: should I add a getter/setter that makes this 
ActualText thing optional? It should be active by default because I believe 
that it is useful in most cases.

e.g. ConsiderActualText / ActivateActualText / IncludeActualText  / whatever

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5657) SMaskInData not supported for JPX images



[ 
https://issues.apache.org/jira/browse/PDFBOX-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876632#comment-17876632
 ] 

Tilman Hausherr edited comment on PDFBOX-5657 at 8/26/24 8:53 AM:
--

This related issue
https://github.com/mozilla/pdf.js/issues/11306
won't look better because there's an exception in the JPEG2000 decoder, see
https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/9


was (Author: tilman):
This related issue
https://github.com/mozilla/pdf.js/issues/11306
won't look better because there's an exception in the JPEG2000 decoder.

> SMaskInData not supported for JPX images
> 
>
> Key: PDFBOX-5657
> URL: https://issues.apache.org/jira/browse/PDFBOX-5657
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.29, 3.0.0 PDFBox, 4.0.0
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: JPEG2000, JPXDecode, JPXFilter
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: PDFJS-16782-SMaskInData.pdf
>
>
> JPX images can have transparency information and not only we don't support 
> that, but the images look broken.
> For now, lets just return the opaque image until there's a good idea what to 
> do. Maybe we have to return the mask in the DecodeResult. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5657) SMaskInData not supported for JPX images



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5657.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

This related issue
https://github.com/mozilla/pdf.js/issues/11306
won't look better because there's an exception in the JPEG2000 decoder.

> SMaskInData not supported for JPX images
> 
>
> Key: PDFBOX-5657
> URL: https://issues.apache.org/jira/browse/PDFBOX-5657
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.29, 3.0.0 PDFBox, 4.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: JPEG2000, JPXDecode, JPXFilter
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: PDFJS-16782-SMaskInData.pdf
>
>
> JPX images can have transparency information and not only we don't support 
> that, but the images look broken.
> For now, lets just return the opaque image until there's a good idea what to 
> do. Maybe we have to return the mask in the DecodeResult. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding

2024-08-25 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5872:

Affects Version/s: 2.0.32

> Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
> 
>
> Key: PDFBOX-5872
> URL: https://issues.apache.org/jira/browse/PDFBOX-5872
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
>
> [https://github.com/dbmdz/imageio-jnr] / 
> [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg]
>  is an alternative JPEG2000 implementation for Java ImageIO that uses the 
> native OpenJPEG library as its backend.
> Unfortunately, it doesn't work out of the box because it doesn't implement 
> raster reading (canReadRaster not overridden, returns false), and PDFBox uses 
> canReadRaster() to validate image reader instances. However, it doesn't 
> appear that there is any real reliance on raster support in PDFBox (at least 
> in version 3) - if I patch the library to lie about raster support, it seems 
> to work perfectly.
> A further complication arises when the OpenJPEG native library cannot be 
> found: imageio-openjpeg returns null as the reader instance, which causes PDF 
> rendering to fail with an NPE, even if another JPEG2000 reader is available. 
> This can be remedied with a simple null check.
> [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until 
> then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox 
> 3.0.3 as a workaround, so long as the native library is correctly installed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding

2024-08-25 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5872.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

Done, thanks!

> Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
> 
>
> Key: PDFBOX-5872
> URL: https://issues.apache.org/jira/browse/PDFBOX-5872
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> [https://github.com/dbmdz/imageio-jnr] / 
> [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg]
>  is an alternative JPEG2000 implementation for Java ImageIO that uses the 
> native OpenJPEG library as its backend.
> Unfortunately, it doesn't work out of the box because it doesn't implement 
> raster reading (canReadRaster not overridden, returns false), and PDFBox uses 
> canReadRaster() to validate image reader instances. However, it doesn't 
> appear that there is any real reliance on raster support in PDFBox (at least 
> in version 3) - if I patch the library to lie about raster support, it seems 
> to work perfectly.
> A further complication arises when the OpenJPEG native library cannot be 
> found: imageio-openjpeg returns null as the reader instance, which causes PDF 
> rendering to fail with an NPE, even if another JPEG2000 reader is available. 
> This can be remedied with a simple null check.
> [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until 
> then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox 
> 3.0.3 as a workaround, so long as the native library is correctly installed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding

2024-08-22 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876045#comment-17876045
 ] 

Tilman Hausherr commented on PDFBOX-5872:
-

{quote}However, it doesn't appear that there is any real reliance on raster 
support in PDFBox (at least in version 3){quote}

{{readRaster()}} is called for CMYK images. Wouldn't it be better to have your 
modified method as a separate private method just for JPX?

> Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
> 
>
> Key: PDFBOX-5872
> URL: https://issues.apache.org/jira/browse/PDFBOX-5872
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
>
> [https://github.com/dbmdz/imageio-jnr] / 
> [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg]
>  is an alternative JPEG2000 implementation for Java ImageIO that uses the 
> native OpenJPEG library as its backend.
> Unfortunately, it doesn't work out of the box because it doesn't implement 
> raster reading (canReadRaster not overridden, returns false), and PDFBox uses 
> canReadRaster() to validate image reader instances. However, it doesn't 
> appear that there is any real reliance on raster support in PDFBox (at least 
> in version 3) - if I patch the library to lie about raster support, it seems 
> to work perfectly.
> A further complication arises when the OpenJPEG native library cannot be 
> found: imageio-openjpeg returns null as the reader instance, which causes PDF 
> rendering to fail with an NPE, even if another JPEG2000 reader is available. 
> This can be remedied with a simple null check.
> [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until 
> then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox 
> 3.0.3 as a workaround, so long as the native library is correctly installed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5869) Checkstyle

2024-08-21 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5869.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

That's it for now. It will only prevent the worst "transgressions".

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5869) Checkstyle

2024-08-21 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5869:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875214#comment-17875214
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Another thought I just had was to extend TextPosition and add the setter there 
and pass this object to the method of the base class of processTextPosition(), 
however TextPosition is final.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5871) Rendering never finishes



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5871:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> Rendering never finishes
> 
>
> Key: PDFBOX-5871
> URL: https://issues.apache.org/jira/browse/PDFBOX-5871
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>    Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox
>
> Attachments: 2_42.pdf, image-2024-08-20-12-22-36-716.png
>
>
> Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that 
> it doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5871) Rendering never finishes



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5871:

Attachment: (was: screenshot-1.png)

> Rendering never finishes
> 
>
> Key: PDFBOX-5871
> URL: https://issues.apache.org/jira/browse/PDFBOX-5871
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>    Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox
>
> Attachments: 2_42.pdf, image-2024-08-20-12-22-36-716.png
>
>
> Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that 
> it doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5871) Rendering never finishes



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5871:

Attachment: screenshot-1.png

> Rendering never finishes
> 
>
> Key: PDFBOX-5871
> URL: https://issues.apache.org/jira/browse/PDFBOX-5871
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>    Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox
>
> Attachments: 2_42.pdf, screenshot-1.png
>
>
> Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that 
> it doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5871) Rendering never finishes

Tilman Hausherr created PDFBOX-5871:
---

 Summary: Rendering never finishes
 Key: PDFBOX-5871
 URL: https://issues.apache.org/jira/browse/PDFBOX-5871
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Reporter: Tilman Hausherr
 Fix For: 2.0.33, 3.0.4 PDFBox
 Attachments: 2_42.pdf

Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that it 
doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875100#comment-17875100
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Oops, no, it's not that easy. I forgot that we need 
{{TextPosition.setUnicode()}} which doesn't exist in the released versions. And 
in the snapshot I've made it package local to avoid people messing around.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874947#comment-17874947
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Yes this could be possible. All the changes except one could be done by using 
an extension of the stripper. The suppressDuplicateOverlappingText problem 
would have to be solved by saving the value when ActualText is active and 
restoring it afterwards.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5870:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5870.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: CMYK
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5870:

Labels: CMYK  (was: )

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
>  Labels: CMYK
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5870:

Component/s: Rendering

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



[ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874891#comment-17874891
 ] 

Tilman Hausherr commented on PDFBOX-5870:
-

Could you attach a PDF where this happens?

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874801#comment-17874801
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/19/24 7:40 AM:
--

Here's the excel file with the differences: 
[^content_diffs_with_exceptions-ActualText.xlsx]. This is from the Apache Tika 
project which also uses PDFBox.

Look at the columns U and W (in yellow) and compare with V and X. Usually V and 
X look better. Empty content in the yellow columns means we "lost" something 
during the update. Look also at the header column names to understand what they 
mean. Surprisingly (for me, maybe less for you) the non latin texts are the 
ones that are more improved.


was (Author: tilman):
Here's the excel file with the differences: 
[^content_diffs_with_exceptions-ActualText.xlsx] 

Look at the columns U and W (in yellow) and compare with V and X. Usually V and 
X look better. Empty content in the yellow columns means we "lost" something 
during the update. Look also at the header column names to understand what they 
mean. Surprisingly (for me, maybe less for you) the non latin texts are the 
ones that are more improved.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874801#comment-17874801
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Here's the excel file with the differences: 
[^content_diffs_with_exceptions-ActualText.xlsx] 

Look at the columns U and W (in yellow) and compare with V and X. Usually V and 
X look better. Empty content in the yellow columns means we "lost" something 
during the update. Look also at the header column names to understand what they 
mean. Surprisingly (for me, maybe less for you) the non latin texts are the 
ones that are more improved.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Attachment: content_diffs_with_exceptions-ActualText.xlsx

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-18 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874700#comment-17874700
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

I ran a comparison on several 10 PDF files. While there were many 
improvements, I discovered that /ActualText is also used to PREVENT text 
extraction, as shown by these files:
[^PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf] 
[^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf] 
[^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf]


> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, 
> okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-18 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Attachment: 
PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf
PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf
PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, 
> okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5869) Checkstyle

2024-08-18 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874647#comment-17874647
 ] 

Tilman Hausherr commented on PDFBOX-5869:
-

It should now work for the trunk, both with mvn checkstyle:check and for an 
ordinary build. It will prevent the "worst" things only. I didn't manage to 
create a regexp for all legal headers and mostly gave up on that one after 
failing with xmpbox, and maybe I shouldn't have bothered at all because we 
already have delegated that part to the "pedantic" build profile.

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874501#comment-17874501
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/17/24 12:57 PM:
---

It's already done elsewhere and makes sure that the logic isn't applied during 
an ActualText segment:
{code}
if (suppressDuplicateOverlappingText && actualText == null)
{code}
Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to 
true even if it was set to false (it's an obscure option of the stripper).


was (Author: tilman):
It's already done elsewhere:
{code}
if (suppressDuplicateOverlappingText && actualText == null)
{code}
Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to 
true even if it was set to false (it's an obscure option of the stripper).

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874501#comment-17874501
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

It's already done elsewhere:
{code}
if (suppressDuplicateOverlappingText && actualText == null)
{code}
Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to 
true even if it was set to false (it's an obscure option of the stripper).

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-2740) Text extraction failed on Korean PDF



 [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2740.
---
Resolution: Not A Problem

The /ActualText problem was fixed in PDFBOX-5868. However extraction of the 
file he had been improved before already.

> Text extraction failed on Korean PDF
> 
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>Reporter: Julien Ortega
>Assignee: John Hewson
>Priority: Major
>  Labels: ActualText
> Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-2740) Text extraction failed on Korean PDF



 [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-2740:
-

> Text extraction failed on Korean PDF
> 
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>Reporter: Julien Ortega
>Assignee: John Hewson
>Priority: Major
> Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-2740) Text extraction failed on Korean PDF



 [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2740:

Labels: ActualText  (was: )

> Text extraction failed on Korean PDF
> 
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>Reporter: Julien Ortega
>Assignee: John Hewson
>Priority: Major
>  Labels: ActualText
> Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-4532) PDFTextStripper replacing the decimal with white space



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-4532.
---
Resolution: Duplicate

Fixed in PDFBOX-5868

> PDFTextStripper replacing the decimal with white space
> --
>
> Key: PDFBOX-4532
> URL: https://issues.apache.org/jira/browse/PDFBOX-4532
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.15
>Reporter: Akash Gupta
>Priority: Major
>  Labels: ActualText
> Attachments: FSUSA00BDD.pdf, PDFBOX-4532-reduced.pdf, SO71723006.pdf, 
> code_textStripper.PNG, numbers_without_decimal.PNG
>
>
> I'm using the PDFTextStripperByArea to be specific and trying to extract a 
> particular area from the document. 
> In the output most the numbers (all but one) have their decimal point 
> replaced by a white space. When I copy and paste the text using Abobe 
> reader/chrome the decimal point are preserved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Assigned] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned PDFBOX-5868:
---

Assignee: Tilman Hausherr

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Affects Version/s: 2.0.32

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
>  Labels: ActualText
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Labels: ActualText  (was: )

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
>  Labels: ActualText
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3248) Unwanted spaces in text extraction (2)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3248:

Labels: ActualText  (was: )

> Unwanted spaces in text extraction (2)
> --
>
> Key: PDFBOX-3248
> URL: https://issues.apache.org/jira/browse/PDFBOX-3248
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.11, 2.0.0
>    Reporter: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Attachments: PDFBOX-3248-spaces.pdf
>
>
> The attached file provided by Francisco from the user mailing list has spaces 
> in text extraction regardless of setting spacingTolerance or 
> averageCharTolerance. I was unable to extract "Cada frasco ampolla" which 
> looked straightforward in rendering, but it always appeared as "Ca da fras co 
> ampo lla". Adobe Reader has no such problem.
> The content stream has this:
> {code}
>  6 0 1.058 6 122.0924 312.51 Tm
>  (Ca) Tj
>  /Span << /ActualText (\376\377\000\255) >> BDC
>( ) Tj
>  EMC
>  [ (da ) -301 (fras) ] TJ
>  /Span << /ActualText (\376\377\000\255) >> BDC
>( ) Tj
>  EMC
>  [ (co ) -301 (ampo) ] TJ
>  /Span << /ActualText (\376\377\000\255) >> BDC
>( ) Tj
>  EMC
>  [ (lla ) -301 (con) ] TJ
> {code}
> So there are really spaces there, and we keep them. Adobe is smarter, and 
> ignores them because they are overwritten thanks to the "-301" backwards 
> positioning.
> Would /ActualText help? However it is always the same here...
> Would it help to ignore spaces and decide based on positions only, maybe as 
> an option? I added these two lines below the first existing one:
> {code}
> String characterValue = position.getUnicode();
> if (" ".equals(characterValue))
> continue;
> {code}
> The output looks promising:
> {quote}
> F ó r m u l a :
> Cronopen® Balsámico Adultos:
> Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
> 100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
> Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife
> nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.
> {quote}
> A complete test brings many differences, most are harmless or are 
> improvements. Only one test case really fails, hello3.pdf. Original extract 
> is "Hello محمد World.", new extract is "Hello .Worldمحمد".
> More from Francisco
> {quote}
> As additional information, I've found 2 related posts (about another tools)
> in StackOverflow:
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-3248) Unwanted spaces in text extraction (2)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-3248.
---
Resolution: Duplicate

Fixed in PDFBOX-5868

> Unwanted spaces in text extraction (2)
> --
>
> Key: PDFBOX-3248
> URL: https://issues.apache.org/jira/browse/PDFBOX-3248
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.11, 2.0.0
>    Reporter: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Attachments: PDFBOX-3248-spaces.pdf
>
>
> The attached file provided by Francisco from the user mailing list has spaces 
> in text extraction regardless of setting spacingTolerance or 
> averageCharTolerance. I was unable to extract "Cada frasco ampolla" which 
> looked straightforward in rendering, but it always appeared as "Ca da fras co 
> ampo lla". Adobe Reader has no such problem.
> The content stream has this:
> {code}
>  6 0 1.058 6 122.0924 312.51 Tm
>  (Ca) Tj
>  /Span << /ActualText (\376\377\000\255) >> BDC
>( ) Tj
>  EMC
>  [ (da ) -301 (fras) ] TJ
>  /Span << /ActualText (\376\377\000\255) >> BDC
>( ) Tj
>  EMC
>  [ (co ) -301 (ampo) ] TJ
>  /Span << /ActualText (\376\377\000\255) >> BDC
>( ) Tj
>  EMC
>  [ (lla ) -301 (con) ] TJ
> {code}
> So there are really spaces there, and we keep them. Adobe is smarter, and 
> ignores them because they are overwritten thanks to the "-301" backwards 
> positioning.
> Would /ActualText help? However it is always the same here...
> Would it help to ignore spaces and decide based on positions only, maybe as 
> an option? I added these two lines below the first existing one:
> {code}
> String characterValue = position.getUnicode();
> if (" ".equals(characterValue))
> continue;
> {code}
> The output looks promising:
> {quote}
> F ó r m u l a :
> Cronopen® Balsámico Adultos:
> Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
> 100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
> Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife
> nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.
> {quote}
> A complete test brings many differences, most are harmless or are 
> improvements. Only one test case really fails, hello3.pdf. Original extract 
> is "Hello محمد World.", new extract is "Hello .Worldمحمد".
> More from Francisco
> {quote}
> As additional information, I've found 2 related posts (about another tools)
> in StackOverflow:
> http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
> http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1787#comment-1787
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/17/24 8:09 AM:
--

{quote}why why there is another boolean firstactualtext (...)  and where you 
are setting unicode to empty string{quote}

So that the replacement is done only the first time, and then next glyphs get 
their unicode replaced with the empty string, instead of the previous unicode 
that is no longer needed

{quote}and why you are removing hyphens{quote}

These were soft hyphens, they're just annoying and I can't see any purpose.


I'll commit the changes soon, this includes some refactoring already done 
locally.


was (Author: tilman):
{quote}why why there is another boolean firstactualtext (...)  and where you 
are setting unicode to empty string{quote}

So that the replacement is done only the first time, and then next glyphs get 
their unicode replaced with the empty string, instead of the previous unicode 
that is no longer needed

{quote}and why you are removing hyphens{quote}

These were soft hyphens, they're just annoying and I can't see any purpose.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1787#comment-1787
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

{quote}why why there is another boolean firstactualtext (...)  and where you 
are setting unicode to empty string{quote}

So that the replacement is done only the first time, and then next glyphs get 
their unicode replaced with the empty string, instead of the previous unicode 
that is no longer needed

{quote}and why you are removing hyphens{quote}

These were soft hyphens, they're just annoying and I can't see any purpose.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874369#comment-17874369
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/17/24 6:30 AM:
--

There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no width 
(I have to investigate why), so the stripper thinks there is a space.

Update: the TextPosition items with empty unicode are missing in the list for 
some reason.

Update 2: it's the suppressDuplicateOverlappingText segment, it has to be 
disabled during an /ActualText occurence 

 !screenshot-2.png! 


was (Author: tilman):
There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no width 
(I have to investigate why), so the stripper thinks there is a space.

Update: the TextPosition items with empty unicode are missing in the list for 
some reason.

 !screenshot-2.png! 

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874369#comment-17874369
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/17/24 4:12 AM:
--

There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no width 
(I have to investigate why), so the stripper thinks there is a space.

Update: the TextPosition items with empty unicode are missing in the list for 
some reason.

 !screenshot-2.png! 


was (Author: tilman):
There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no width 
(I have to investigate why), so the stripper thinks there is a space.

 !screenshot-2.png! 

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874369#comment-17874369
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/16/24 7:06 PM:
--

There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no width 
(I have to investigate why), so the stripper thinks there is a space.

 !screenshot-2.png! 


was (Author: tilman):
There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no 
length (I have to investigate why), so the stripper thinks there is a space.

 !screenshot-2.png! 

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874369#comment-17874369
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

There is a problem that I didn't notice immediately, the spaces are incorrect. 
This is the new extraction:

  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥

This what it should have been:

  वारी  फेरी  बलि  गई, जित देखौं  तित तूँ ॥

The reason seems to be that the glyphs where I remove the unicode have no 
length (I have to investigate why), so the stripper thinks there is a space.

 !screenshot-2.png! 

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Attachment: screenshot-2.png

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874268#comment-17874268
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

First I'll need to make the changes I mentioned. The next releases of PDFBox 
will be in a few months because we just had a release. Then there will be a 
Tika release. So maybe end of the year or beginning of next year. Btw 
/ActualText isn't the biggest cause of extraction troubles. The biggest causes 
are broken PDFs or PDFs obfuscated to prevent text extraction. The change here 
will only slightly improve the quality of extractions that already work.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874189#comment-17874189
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/16/24 12:31 PM:
---

[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work 😂

I had a look at the {{PDFMarkedContentExtractor}} class and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
boolean inActualText = false;
boolean firstActualText = false;
String actualText = null;

@Override
public void endMarkedContentSequence()
{
inActualText = false;
super.endMarkedContentSequence();
}

@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
{
PDMarkedContent mc = PDMarkedContent.create(tag, properties);
actualText = mc.getActualText();
if (actualText != null)
{
actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
inActualText = true;
firstActualText = true;
//System.out.println("actualText: " + actualText);
}
super.beginMarkedContentSequence(tag, properties);
}
{code}
wherever you want

3) add
{code}
if (inActualText)
{
if (firstActualText)
{
text.setUnicode(actualText);
firstActualText = false;
}
else
{
text.setUnicode("");
}
}
{code}
At the beginning of {{processTextPosition(TextPosition text)}}.

4) Add
{code}
void setUnicode(String unicode)
{
this.unicode = unicode;
}
{code}
in the {{Textposition}} class.

There are lots of differences in build texts, most are better, some look weird 
(lots of spaces). Your file is extracted differently now(non latin parts):


 हिंदी   (hindi):
  तूँ तूँ करता  तूँ भया , मुझ मैं रही  न हूँ।
  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥
 
 जी वा त्मा  कह रही  है कि  ‘तू है’ ‘तू है’ कहते−कहते मेरा  अहंका र समा प्त हो  
गया । इस तरह भगवा न पर न्यौ छा वर
 हो ते−हो ते मैं पूर्णतया  समर्पि त हो  गई। अब तो  जि धर देखती  हूँ उधर तू ही  
दि खा ई देता  है।
 
  தமிழ் (tamil):
 
  ஆக்கம் அதர்வினா ய்ச் செ ல்லும் அசை விலா
 ஊக்க முடை யா  னுழை
நா மா ர்க்குங் குடியல்லோ ம் நமனை  யஞ்சோ ம்
நரகத்தி லிடர்ப்படோ ம் நடலை  யில்லோ ம்
ஏமா ப்போ ம் பிணியறியோ ம் பணிவோ  மல்லோ ம்
 

இன்பமே எந்நா ளுந் துன்ப மில்லை
தா மா ர்க்குங் குடியல்லா த் தன்மை  யா ன
சங்கரனற் சங்கவெ ண் குழை யோ ர் கா திற்
கோ மா ற்கே  நா மெ ன்றும் மீளா  ஆளா ய்க்
 கொ ய்ம்மலர்ச்சே  வடியிணை யே  குறுகி னோ மே .
 
 Bengali:
আঠা রো  বছর বয়স কী  দুঃ সহ
র্স্পধা য় নে য় মা থা  তো লবা র ঝুঁ কি ,
আঠা রো  বছর বয়সে ই অহরহ
বি রা ট দুঃ সা হসে রা  দে য় যে  উঁকি ।
আঠা রো  বছর বয়সে র নে ই ভয়
পদা ঘা তে  চা য় ভা ঙতে  পা থর বা ধা ,
এ বয়সে  কে উ মা থা  নো য়া বা র নয়-
আঠা রো  বছর বয়স জা নে  না  কাঁ দা ।
এ বয়স জা নে  রক্তদা নে র পুণ্য
 বা ষ্পে র বে গে  স্টি মা রে র মতো  চলে ,
 
 Japnese:
 古池や 蛙飛び込む 水の音
 




was (Author: tilman):
[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work 😂

I had a look at PDFMarkedContentExtractor and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
boolean inActualText = false;
boolean firstActualText = false;
String actualText = null;

@Override
public void endMarkedContentSequence()
{
inActualText = false;
super.endMarkedContentSequence();
}

@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
{
PDMarkedContent mc = PDMarkedContent.create(tag, properties);
actualText = mc.getActualText();
if (actualText != null)
{
actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
inActualText = true;
firstActualText = true;
//System.out.println("actualText: " + actualText);
}
super.beginMarkedContentSequence(tag, properties);
}
{code}
wherever you want

3) add
{code}
if (inActualText)
{
if (firstActualText)
{
text.setUnicode(actual

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874222#comment-17874222
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

The code has one flaw, that it doesn't "stack" the beginMarkedContentSequence / 
endMarkedContentSequence calls, which would make them useless if they are 
nested, e.g. BMCS with ActualText, BMCS with something else, EMCS, TJ, EMCS 
would mean that the ActualText is ignored. I'll fix that at a later time (The 
{{PDFMarkedContentExtractor}} class does it properly).

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874201#comment-17874201
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Ideally, the changes I just posted solve it... what you can do is to apply them 
to the source code, build it with -skipTests (because some of the tests will 
fail due to text extraction differences) and then run with your file(s) and 
tell whether the text extraction is what you'd expect or not.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874189#comment-17874189
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/16/24 10:59 AM:
---

[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work 😂

I had a look at PDFMarkedContentExtractor and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
boolean inActualText = false;
boolean firstActualText = false;
String actualText = null;

@Override
public void endMarkedContentSequence()
{
inActualText = false;
super.endMarkedContentSequence();
}

@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
{
PDMarkedContent mc = PDMarkedContent.create(tag, properties);
actualText = mc.getActualText();
if (actualText != null)
{
actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
inActualText = true;
firstActualText = true;
//System.out.println("actualText: " + actualText);
}
super.beginMarkedContentSequence(tag, properties);
}
{code}
wherever you want

3) add
{code}
if (inActualText)
{
if (firstActualText)
{
text.setUnicode(actualText);
firstActualText = false;
}
else
{
text.setUnicode("");
}
}
{code}
At the beginning of {{processTextPosition(TextPosition text)}}.

4) Add
{code}
void setUnicode(String unicode)
{
this.unicode = unicode;
}
{code}
in the {{Textposition}} class.

There are lots of differences in build texts, most are better, some look weird 
(lots of spaces). Your file is extracted differently now(non latin parts):


 हिंदी   (hindi):
  तूँ तूँ करता  तूँ भया , मुझ मैं रही  न हूँ।
  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥
 
 जी वा त्मा  कह रही  है कि  ‘तू है’ ‘तू है’ कहते−कहते मेरा  अहंका र समा प्त हो  
गया । इस तरह भगवा न पर न्यौ छा वर
 हो ते−हो ते मैं पूर्णतया  समर्पि त हो  गई। अब तो  जि धर देखती  हूँ उधर तू ही  
दि खा ई देता  है।
 
  தமிழ் (tamil):
 
  ஆக்கம் அதர்வினா ய்ச் செ ல்லும் அசை விலா
 ஊக்க முடை யா  னுழை
நா மா ர்க்குங் குடியல்லோ ம் நமனை  யஞ்சோ ம்
நரகத்தி லிடர்ப்படோ ம் நடலை  யில்லோ ம்
ஏமா ப்போ ம் பிணியறியோ ம் பணிவோ  மல்லோ ம்
 

இன்பமே எந்நா ளுந் துன்ப மில்லை
தா மா ர்க்குங் குடியல்லா த் தன்மை  யா ன
சங்கரனற் சங்கவெ ண் குழை யோ ர் கா திற்
கோ மா ற்கே  நா மெ ன்றும் மீளா  ஆளா ய்க்
 கொ ய்ம்மலர்ச்சே  வடியிணை யே  குறுகி னோ மே .
 
 Bengali:
আঠা রো  বছর বয়স কী  দুঃ সহ
র্স্পধা য় নে য় মা থা  তো লবা র ঝুঁ কি ,
আঠা রো  বছর বয়সে ই অহরহ
বি রা ট দুঃ সা হসে রা  দে য় যে  উঁকি ।
আঠা রো  বছর বয়সে র নে ই ভয়
পদা ঘা তে  চা য় ভা ঙতে  পা থর বা ধা ,
এ বয়সে  কে উ মা থা  নো য়া বা র নয়-
আঠা রো  বছর বয়স জা নে  না  কাঁ দা ।
এ বয়স জা নে  রক্তদা নে র পুণ্য
 বা ষ্পে র বে গে  স্টি মা রে র মতো  চলে ,
 
 Japnese:
 古池や 蛙飛び込む 水の音
 




was (Author: tilman):
[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work 😂

I had a look at PDFMarkedContentExtractor and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
boolean inActualText = false;
boolean firstActualText = false;
String actualText = null;

@Override
public void endMarkedContentSequence()
{
inActualText = false;
//TODO add the text
super.endMarkedContentSequence();
}

@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
{
PDMarkedContent mc = PDMarkedContent.create(tag, properties);
actualText = mc.getActualText();
if (actualText != null)
{
actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
inActualText = true;
firstActualText = true;
//System.out.println("actualText: " + actualText);
}
super.beginMarkedContentSequence(tag, properties);
}
{code}
wherever you want

3) add
{code}
if (inActualText)
{
if (firstActualText)
{
text.setUnicode(actual

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874189#comment-17874189
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

[~manish003] So you're appealing to our pride and think that such a transparent 
manipulation attempt would work 😂

I had a look at PDFMarkedContentExtractor and at 
https://stackoverflow.com/questions/78705656/ and 
https://stackoverflow.com/questions/44029191/ . Using parts of 
PDFMarkedContentExtractor in the stripper helps;

1) add
{code}
addOperator(new BeginMarkedContentSequenceWithProperties(this));
addOperator(new BeginMarkedContentSequence(this));
addOperator(new EndMarkedContentSequence(this));
{code}
to the constructor of the stripper

2) add
{code}
boolean inActualText = false;
boolean firstActualText = false;
String actualText = null;

@Override
public void endMarkedContentSequence()
{
inActualText = false;
//TODO add the text
super.endMarkedContentSequence();
}

@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary 
properties)
{
PDMarkedContent mc = PDMarkedContent.create(tag, properties);
actualText = mc.getActualText();
if (actualText != null)
{
actualText = actualText.replace("\u00ad", ""); // remove soft 
hyphens
inActualText = true;
firstActualText = true;
//System.out.println("actualText: " + actualText);
}
super.beginMarkedContentSequence(tag, properties);
}
{code}
wherever you want

3) add
{code}
if (inActualText)
{
if (firstActualText)
{
text.setUnicode(actualText);
firstActualText = false;
}
else
{
text.setUnicode("");
}
}
{code}
At the beginning of {{processTextPosition(TextPosition text)}}.

4) Add
{code}
void setUnicode(String unicode)
{
this.unicode = unicode;
}
{code}
in the {{Textposition}} class.

There are lots of differences in build texts, most are better, some look weird 
(lots of spaces). Your file is extracted differently now(non latin parts):


 हिंदी   (hindi):
  तूँ तूँ करता  तूँ भया , मुझ मैं रही  न हूँ।
  वा री  फेरी  बलि  गई, जि त देखौं  ति त तूँ ॥
 
 जी वा त्मा  कह रही  है कि  ‘तू है’ ‘तू है’ कहते−कहते मेरा  अहंका र समा प्त हो  
गया । इस तरह भगवा न पर न्यौ छा वर
 हो ते−हो ते मैं पूर्णतया  समर्पि त हो  गई। अब तो  जि धर देखती  हूँ उधर तू ही  
दि खा ई देता  है।
 
  தமிழ் (tamil):
 
  ஆக்கம் அதர்வினா ய்ச் செ ல்லும் அசை விலா
 ஊக்க முடை யா  னுழை
நா மா ர்க்குங் குடியல்லோ ம் நமனை  யஞ்சோ ம்
நரகத்தி லிடர்ப்படோ ம் நடலை  யில்லோ ம்
ஏமா ப்போ ம் பிணியறியோ ம் பணிவோ  மல்லோ ம்
 

இன்பமே எந்நா ளுந் துன்ப மில்லை
தா மா ர்க்குங் குடியல்லா த் தன்மை  யா ன
சங்கரனற் சங்கவெ ண் குழை யோ ர் கா திற்
கோ மா ற்கே  நா மெ ன்றும் மீளா  ஆளா ய்க்
 கொ ய்ம்மலர்ச்சே  வடியிணை யே  குறுகி னோ மே .
 
 Bengali:
আঠা রো  বছর বয়স কী  দুঃ সহ
র্স্পধা য় নে য় মা থা  তো লবা র ঝুঁ কি ,
আঠা রো  বছর বয়সে ই অহরহ
বি রা ট দুঃ সা হসে রা  দে য় যে  উঁকি ।
আঠা রো  বছর বয়সে র নে ই ভয়
পদা ঘা তে  চা য় ভা ঙতে  পা থর বা ধা ,
এ বয়সে  কে উ মা থা  নো য়া বা র নয়-
আঠা রো  বছর বয়স জা নে  না  কাঁ দা ।
এ বয়স জা নে  রক্তদা নে র পুণ্য
 বা ষ্পে র বে গে  স্টি মা রে র মতো  চলে ,
 
 Japnese:
 古池や 蛙飛び込む 水の音
 



> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document vie

[jira] [Commented] (PDFBOX-5869) Checkstyle

2024-08-15 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873865#comment-17873865
 ] 

Tilman Hausherr commented on PDFBOX-5869:
-

It turns out that some of the settings override others, I'll use "skip" only 
until we decide something about it. I have no strong opinion about it, 
personally I don't use it, Tika uses it but I've learned to live it with it 
although it's a pain for some. However it has the advantage to enforce 
discipline. I could change some of the past flaws in the code automatically but 
this would pollute the history.

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5869) Checkstyle



[ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873712#comment-17873712
 ] 

Tilman Hausherr commented on PDFBOX-5869:
-

Only "xmpbox" has checkstyle defined but "io" fails quickly. I haven't managed 
to prevent this. Here is what I tried, I added this in the parent pom.xml above 
the jacoco segment but it still fails:
{code:xml}

org.apache.maven.plugins
maven-checkstyle-plugin
3.4.0

 true
true
false
false


{code}
I wonder what I'm missing.

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873485#comment-17873485
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/14/24 10:30 AM:
---

It's in the content stream:
 !screenshot-1.png! 

Here's some code to detect it:
{code:java}
String urlText = 
"https://issues.apache.org/jira/secure/attachment/13070873/multilingual_test.pdf";;
try (PDDocument doc = PDDocument.load(new URL(urlText).openStream()))
{
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
PDFStreamParser pdfStreamParser = new PDFStreamParser(page);
Object token = pdfStreamParser.parseNextToken();
while (token != null)
{
if (token instanceof COSDictionary && ((COSDictionary) 
token).containsKey(COSName.ACTUAL_TEXT))
{
System.out.println("/ActualText in page " + (p + 1));
break;
}
token = pdfStreamParser.parseNextToken();
}
pdfStreamParser.close();
}
}
{code}
However, having it doesn't mean that all will be bad. For example the 
extraction of the first page looks ok. Also, there are many other different 
reasons that you could have a bad text extraction, e.g. obfuscation.


was (Author: tilman):
It's in the content stream:
 !screenshot-1.png! 

Here's some code to detect it:
{code:java}
String urlText = 
"https://issues.apache.org/jira/secure/attachment/13070873/multilingual_test.pdf";;
try (PDDocument doc = PDDocument.load(new URL(urlText).openStream()))
{
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
PDFStreamParser pdfStreamParser = new PDFStreamParser(page);
Object token = pdfStreamParser.parseNextToken();
while (token != null)
{
if (token instanceof COSDictionary && ((COSDictionary) 
token).containsKey(COSName.ACTUAL_TEXT))
{
System.out.println("/ActualText in page " + (p + 1));
break;
}
token = pdfStreamParser.parseNextToken();
}
pdfStreamParser.close();
}
}
{code}


> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt, 
> screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify 
> pii in them. making it an app so users can define their own patterns. I am 
> using apache tika for parsing documents. I noticed problem with extracted PDF 
> text (other filetypes parse fine). used executable pdfbox jar to conclude 
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's 
> extract text to confirm the problem is not with the pdf. i  want to extract 
> these multilingual text to run pattern matching on them alone and do not need 
> to display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873485#comment-17873485
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

It's in the content stream:
 !screenshot-1.png! 

Here's some code to detect it:
{code:java}
String urlText = 
"https://issues.apache.org/jira/secure/attachment/13070873/multilingual_test.pdf";;
try (PDDocument doc = PDDocument.load(new URL(urlText).openStream()))
{
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
PDFStreamParser pdfStreamParser = new PDFStreamParser(page);
Object token = pdfStreamParser.parseNextToken();
while (token != null)
{
if (token instanceof COSDictionary && ((COSDictionary) 
token).containsKey(COSName.ACTUAL_TEXT))
{
System.out.println("/ActualText in page " + (p + 1));
break;
}
token = pdfStreamParser.parseNextToken();
}
pdfStreamParser.close();
}
}
{code}


> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt, 
> screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify 
> pii in them. making it an app so users can define their own patterns. I am 
> using apache tika for parsing documents. I noticed problem with extracted PDF 
> text (other filetypes parse fine). used executable pdfbox jar to conclude 
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's 
> extract text to confirm the problem is not with the pdf. i  want to extract 
> these multilingual text to run pattern matching on them alone and do not need 
> to display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Attachment: screenshot-1.png

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt, 
> screenshot-1.png
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify 
> pii in them. making it an app so users can define their own patterns. I am 
> using apache tika for parsing documents. I noticed problem with extracted PDF 
> text (other filetypes parse fine). used executable pdfbox jar to conclude 
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's 
> extract text to confirm the problem is not with the pdf. i  want to extract 
> these multilingual text to run pattern matching on them alone and do not need 
> to display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873432#comment-17873432
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/14/24 7:43 AM:
--

1) This is a duplicate of PDFBOX-3248 (don't be confused by "spaces" in ticket 
title), your content stream contains /ActualText which we don't support, sadly. 
This is a feature that replaces "wrong" text extraction with the correct one.
2) Nothing except contribute code
3) don't know.


was (Author: tilman):
1) This is a duplicate of PDFBOX-3248, your content stream contains /ActualText 
which we don't support, sadly.
2) Nothing except contribute code
3) don't know.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify 
> pii in them. making it an app so users can define their own patterns. I am 
> using apache tika for parsing documents. I noticed problem with extracted PDF 
> text (other filetypes parse fine). used executable pdfbox jar to conclude 
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's 
> extract text to confirm the problem is not with the pdf. i  want to extract 
> these multilingual text to run pattern matching on them alone and do not need 
> to display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873432#comment-17873432
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/14/24 7:42 AM:
--

1) This is a duplicate of PDFBOX-3248, your content stream contains /ActualText 
which we don't support, sadly.
2) Nothing except contribute code
3) don't know.


was (Author: tilman):
This is a duplicate of PDFBOX-3248, your content stream contains /ActualText 
which we don't support, sadly.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify 
> pii in them. making it an app so users can define their own patterns. I am 
> using apache tika for parsing documents. I noticed problem with extracted PDF 
> text (other filetypes parse fine). used executable pdfbox jar to conclude 
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's 
> extract text to confirm the problem is not with the pdf. i  want to extract 
> these multilingual text to run pattern matching on them alone and do not need 
> to display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873432#comment-17873432
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

This is a duplicate of PDFBOX-3248, your content stream contains /ActualText 
which we don't support, sadly.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
> Attachments: adobe_out.txt, multilingual_test.pdf, pdfbox_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> ---
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching to identify 
> pii in them. making it an app so users can define their own patterns. I am 
> using apache tika for parsing documents. I noticed problem with extracted PDF 
> text (other filetypes parse fine). used executable pdfbox jar to conclude 
> that the _problem is in pdfbox and not in tika._ tested with adobe reader's 
> extract text to confirm the problem is not with the pdf. i  want to extract 
> these multilingual text to run pattern matching on them alone and do not need 
> to display the content but only if the pattern is present or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5866) Unable to load password protected pdf

2024-08-11 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872681#comment-17872681
 ] 

Tilman Hausherr commented on PDFBOX-5866:
-

I didn't look because I looked only at the ISO document, which didn't mention 
5, so I left that one untouched. But yes, the "adobe_supplement_iso32000.pdf" 
mentions it, so I'll add that as well.

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>    Reporter: Charles D
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5863) bad comparison of byte with 128

2024-08-11 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872665#comment-17872665
 ] 

Tilman Hausherr commented on PDFBOX-5863:
-

1 and 2 stay 1 and 2 in conversion. Byte values > 0x7f become negative numbers 
in conversion, that's the problem.

> bad comparison of byte with 128
> ---
>
> Key: PDFBOX-5863
> URL: https://issues.apache.org/jira/browse/PDFBOX-5863
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 3.0.2 PDFBox
>Reporter: Dieter von Holten
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
>
> inspection by spotbugs shows a problem in 
> o.a.padfbox.pdmodel.font.FontFactory line 285:
> the partial expression {{header[0] == 0x80 && ...}}
> is always false. The code should be{{ header[0] == (byte) 0x80 && ...}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5866) Unable to load password protected pdf



[ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872401#comment-17872401
 ] 

Tilman Hausherr commented on PDFBOX-5866:
-

👍

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Charles D
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5866) Unable to load password protected pdf



[ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872237#comment-17872237
 ] 

Tilman Hausherr edited comment on PDFBOX-5866 at 8/9/24 8:26 AM:
-

It should work now. Thanks for submitting this. Your file has nothing 
confidential so consider posting the password here unless it's used elsewhere 
(if so then maybe I should delete your mail?). Snapshot builds are or will be 
available here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.4-SNAPSHOT/
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.33-SNAPSHOT/

The build test failure in the examples subproject isn't because of the change, 
I verified this. I'll investigate this separately.
Update: likely a certificate problem with https://www.pki.admin.ch .


was (Author: tilman):
It should work now. Thanks for submitting this. Your file has nothing 
confidential so consider posting the password here unless it's used elsewhere 
(if so then maybe I should delete your mail?). Snapshot builds are or will be 
available here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.4-SNAPSHOT/
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.33-SNAPSHOT/

The build test failure in the examples subproject isn't because of the change, 
I verified this. I'll investigate this separately.

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>    Reporter: Charles D
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5866) Unable to load password protected pdf



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5866.
-
  Assignee: Tilman Hausherr
Resolution: Fixed

It should work now. Thanks for submitting this. Your file has nothing 
confidential so consider posting the password here unless it's used elsewhere 
(if so then maybe I should delete your mail?). Snapshot builds are or will be 
available here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.4-SNAPSHOT/
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.33-SNAPSHOT/

The build test failure in the examples subproject isn't because of the change, 
I verified this. I'll investigate this separately.

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>    Reporter: Charles D
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5866) Unable to load password protected pdf



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5866:

Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Charles D
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5866) Unable to load password protected pdf



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5866:

Component/s: Crypto

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Charles D
>Priority: Major
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5866) Unable to load password protected pdf



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5866:

Affects Version/s: 2.0.32
   3.0.3 PDFBox

> Unable to load  password protected pdf
> --
>
> Key: PDFBOX-5866
> URL: https://issues.apache.org/jira/browse/PDFBOX-5866
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Charles D
>Priority: Major
> Attachments: pdfbox_invalid_pwd.pdf
>
>
> PDFBox is unable to load password protected pdf. Error is "Invalid AES key 
> length: 48 bytes."
> Adobe and qpdf are able to successfully open it , but many other applications 
> I've tried are unable to do so.
> The pdf is created by a third party scanner so unfortunately we have no 
> insight/control over its creation. 
> [^pdfbox_invalid_pwd.pdf] . Please let me know the best way to provide the 
> password. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5866) Unable to load password protected pdf