[jira] [Updated] (PDFBOX-5747) Surrogate pairs with combining diacritics are incorrectly ordered on text extraction

2023-12-26 Thread P Crossa (Jira)
}\uD835\uDC4B\u0302{color} > Surrogate pairs with combining diacritics are incorrectly ordered on text > extraction > > > Key: PDFBOX-5747 > URL: https://issues.apac

[jira] [Updated] (PDFBOX-5747) Surrogate pairs with combining diacritics are incorrectly ordered on text extraction

2023-12-26 Thread P Crossa (Jira)
irs with combining diacritics are incorrectly ordered on text > extraction > > > Key: PDFBOX-5747 > URL: https://issues.apache.org/jira/browse/PDFBOX-5747 >

[jira] [Updated] (PDFBOX-5747) Surrogate pairs with combining diacritics are incorrectly ordered on text extraction

2023-12-26 Thread P Crossa (Jira)
. The attached PDF contains 푋̂. This is composed of 푋, which is represented as the surrogate pair {color:#cc7832}\uD835\uDC4B{color} and the combining diacritic,  {color:#cc7832}\u0302{color} > Surrogate pairs with combining diacritics are incorrectly ordered on text > extr

[jira] [Created] (PDFBOX-5747) Surrogate pairs with combining diacritics are incorrectly ordered on text extraction

2023-12-26 Thread P Crossa (Jira)
P Crossa created PDFBOX-5747: Summary: Surrogate pairs with combining diacritics are incorrectly ordered on text extraction Key: PDFBOX-5747 URL: https://issues.apache.org/jira/browse/PDFBOX-5747 Project

[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-20 Thread Michael Klink (Jira)
didn't find any tag like this in the PDF content.{quote} Then please share the PDF for further analysis. While you're right that in case of your document the text extraction result would improve by _not_ trying to identify gaps, in general one needs this gap detection. > Wrong Text Extract

[jira] [Comment Edited] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)
-2022-10-19-16-48-36-198.png! I am trying to see if we have a similar document with the same behavior that we can edit in order to remove the customer sensitive data. The document we are testing at this moment is signed, and due to this I am unable to edit it and remove the sensitive data. &

[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)
didn't find any tag like this in the PDF content. > Wrong Text Extraction - Unwanted Extra Spaces in the middle of words > > > Key: PDFBOX-5529 > URL: https://issues.apache.org/jir

[jira] [Comment Edited] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)
if we have a similar document with the same behavior that we can edit in order to remove the customer sensitive data. The document we are testing at this moment is signed, and due to this I am unable to edit it and remove the sensitive data. > Wrong

[jira] [Comment Edited] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)
with the same behavior that we can edit in order to remove the customer sensitive data. The document we are testing at this moment is signed, and due to this I am unable to edit it and remove the sensitive data. > Wrong

[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)
ior that we can edit in order to remove the customer sensitive data. The document we are testing at this moment is signed, and due to this I am unable to edit it and remove the sensitive data. > Wrong Text Extraction - Unwanted Ex

[jira] [Updated] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Carlos Alfonso Maya (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Alfonso Maya updated PDFBOX-5529: Attachment: image-2022-10-19-16-48-36-198.png > Wrong Text Extraction - Unwan

[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Michael Klink (Jira)
file has such tags or not. The easiest option would be for you to share the file (or at least a page of it with that behavior). > Wrong Text Extraction - Unwanted Extra Spaces in the middle of words > > >

[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Tilman Hausherr (Jira)
layout, so you get a terrible text extraction. I don't know why Adobe gets correct text. Maybe they use a dictionary approach. > Wrong Text Extraction - Unwanted Extra Spaces in the middle of words > > >

[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Jira
answer that by staring at the code. We'll need some sort of a sample pdf to debug that piece of code. > Wrong Text Extraction - Unwanted Extra Spaces in the middle of words > > > Key:

[jira] [Updated] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-18 Thread Carlos Alfonso Maya (Jira)
text from financial PDF documents. We have been using PDFBox since a long time back, and we have detected a problem related to a bad text extraction on PDFs from a Customer.  Since we worked with Customer Data we cannot shared the PDF besides that are signed and we cannot even edit them

[jira] [Updated] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-18 Thread Carlos Alfonso Maya (Jira)
text from financial PDF documents. We have been using PDFBox since a long time back, and we have detected a problem related to a bad text extraction on PDFs from a Customer.  Since we worked with Customer Data we cannot shared the PDF besides that are signed and we cannot even edit them

[jira] [Created] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-18 Thread Carlos Alfonso Maya (Jira)
Carlos Alfonso Maya created PDFBOX-5529: --- Summary: Wrong Text Extraction - Unwanted Extra Spaces in the middle of words Key: PDFBOX-5529 URL: https://issues.apache.org/jira/browse/PDFBOX-5529

[jira] [Closed] (PDFBOX-586) Text Extraction on Android

2022-09-20 Thread Jira
port of PDFBox [available|https://github.com/TomRoush/PdfBox-Android] > Text Extraction on Android > -- > > Key: PDFBOX-586 > URL: https://issues.apache.org/jira/browse/PDFBOX-586 > Project: PDFBox > Is

Re: text extraction regression tests for 3.x?

2022-06-17 Thread Tim Allison
I wouldn't. :D On Thu, Jun 16, 2022 at 12:16 PM Tilman Hausherr wrote: > Am 15.06.2022 um 12:19 schrieb Tim Allison: > > Reports are here: > > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz > > govdocs1/372/372582.pdf > commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5 >

Re: text extraction regression tests for 3.x?

2022-06-16 Thread Tilman Hausherr
Am 15.06.2022 um 12:19 schrieb Tim Allison: Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz govdocs1/372/372582.pdf commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5 commoncrawl3/VN/VNCWMY6Y4C3XYWA65CQPPSNZSY6OQEEA have lost text. But the first one is a

Re: text extraction regression tests for 3.x?

2022-06-16 Thread Andreas Lehmkuehler
ehmkuehler < andr...@lehmi.de> wrote: Am 06.05.22 um 14:30 schrieb Tim Allison: All, Let me know when makes sense to run the text extraction regression Yes, it'd be useful to have some update results. How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. 3.0.0-

Re: text extraction regression tests for 3.x?

2022-06-16 Thread Andreas Lehmkuehler
, May 8, 2022 at 1:21 PM Andreas Lehmkuehler wrote: Am 06.05.22 um 14:30 schrieb Tim Allison: All, Let me know when makes sense to run the text extraction regression Yes, it'd be useful to have some update results. How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
>>> >> Am 26.05.22 um 15:04 schrieb Tim Allison: >>>> >>> Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26. >>>> The >>>> >>> reports are here: >>>> >>> >>>> https://corpora.tika.apach

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
>> >>> >> Am 26.05.22 um 15:04 schrieb Tim Allison: >>> >>> Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26. >>> The >>> >>> reports are here: >>> >>> >>> https://corpora.tika.apache.org/base/report

Re: text extraction regression tests for 3.x?

2022-06-15 Thread Tim Allison
t;> >>> reports are here: >> >>> >> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz >> >>> >> >>> Happy to rerun with a more recent version of trunk. >> >>> >> >>> Cheers, &g

Re: text extraction regression tests for 3.x?

2022-06-13 Thread Tim Allison
on May 12 against 2.0.26. The > >>> reports are here: > >>> > https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz > >>> > >>> Happy to rerun with a more recent version of trunk. > >>> > >>> Cheers,

Re: text extraction regression tests for 3.x?

2022-06-11 Thread Andreas Lehmkuehler
://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz Happy to rerun with a more recent version of trunk. Cheers,    Tim On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler wrote: Am 06.05.22 um 14:30 schrieb Tim Allison: All,     Let me know when makes sense to run the text extraction

Re: text extraction regression tests for 3.x?

2022-06-07 Thread Andreas Lehmkuehler
um 14:30 schrieb Tim Allison: All,     Let me know when makes sense to run the text extraction regression Yes, it'd be useful to have some update results. How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. 3.0.0-alpha3? tests for 3.x.  I regret I haven't been following our

Re: text extraction regression tests for 3.x?

2022-05-31 Thread Tim Allison
py to rerun with a more recent version of trunk. > > > > Cheers, > > > >Tim > > > > On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler > wrote: > > > >> Am 06.05.22 um 14:30 schrieb Tim Allison: > >>> All, > >>>

Re: text extraction regression tests for 3.x?

2022-05-29 Thread Andreas Lehmkuehler
/base/reports/reports_pdfbox_3x_20220512.tgz Happy to rerun with a more recent version of trunk. Cheers, Tim On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler wrote: Am 06.05.22 um 14:30 schrieb Tim Allison: All, Let me know when makes sense to run the text extraction regression

Re: text extraction regression tests for 3.x?

2022-05-26 Thread Tim Allison
: > Am 06.05.22 um 14:30 schrieb Tim Allison: > > All, > >Let me know when makes sense to run the text extraction regression > Yes, it'd be useful to have some update results. > > How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. > 3.0.0-alpha3

Re: text extraction regression tests for 3.x?

2022-05-08 Thread Andreas Lehmkuehler
Am 06.05.22 um 14:30 schrieb Tim Allison: All, Let me know when makes sense to run the text extraction regression Yes, it'd be useful to have some update results. How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. 3.0.0-alpha3? tests for 3.x. I regret I haven't been

text extraction regression tests for 3.x?

2022-05-06 Thread Tim Allison
All, Let me know when makes sense to run the text extraction regression tests for 3.x. I regret I haven't been following our mailing list as closely as I should be. Cheers, Tim

[jira] [Closed] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

2022-04-01 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-5406. --- Resolution: Not A Bug > Assumption of Identity Not Valid for Text Extract

[jira] [Commented] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

2022-04-01 Thread Michael Tighe (Jira)
also cases where Adobe Reader brings trash. Some files have a /ToUnicode map and still return trash. We don't have a "strict" setting because there's no simple solution. Use a word dictionary to detect whether the output is trash, and then run OCR. > Assumptio

[jira] [Commented] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

2022-04-01 Thread Tilman Hausherr (Jira)
where Adobe Reader brings trash. Some files have a /ToUnicode map and still return trash. We don't have a "strict" setting because there's no simple solution. Use a word dictionary to detect whether the output is trash, and then run OCR. > Assumption of Identity Not Valid for Te

[jira] [Created] (PDFBOX-5406) Assumption of Identity Not Valid for Text Extraction

2022-03-31 Thread Michael Tighe (Jira)
Michael Tighe created PDFBOX-5406: - Summary: Assumption of Identity Not Valid for Text Extraction Key: PDFBOX-5406 URL: https://issues.apache.org/jira/browse/PDFBOX-5406 Project: PDFBox

[jira] [Closed] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-15 Thread Tilman Hausherr (Jira)
ing Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >

[jira] [Reopened] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-15 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened PDFBOX-5290: - > ClassCastException during Text Extract

[jira] [Resolved] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-15 Thread Eric R Manzitti (Jira)
was using an older jar file of pdfbox > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox >

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-15 Thread Eric R Manzitti (Jira)
very much for the patience and guidance.  > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox >

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-12 Thread Eric R Manzitti (Jira)
PDFBox for extract text, but that caused more issues that resolutions) so I am going to be getting this sorted out soon, should be today or tomorrow. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 >

[jira] [Comment Edited] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-08 Thread Eric R Manzitti (Jira)
h instance... was (Author: eric292): I will test this today, and let y'all know.  I am skeptical because I don't see how a fresh built instance with the 2.0.24 version in the pom.xml would possibly get a different version on a newly created "build-image" > ClassCastException

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-08 Thread Eric R Manzitti (Jira)
skeptical because I don't see how a fresh built instance with the 2.0.24 version in the pom.xml would possibly get a different version on a newly created "build-image" > ClassCastException during Text Extraction > - > >

[jira] [Comment Edited] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Tilman Hausherr (Jira)
): No they're the same. Please try a clean build / remove all old versions from the classpath, i.e. look into the directories what's there. If it still happens, please share the stack trace. > ClassCastException during Text Extraction > - > >

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Tilman Hausherr (Jira)
/ remove all old versions from the classpath, i.e. look into the directories what's there. If it still happens, please share the stack trace. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 >

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Eric R Manzitti (Jira)
ernal dependency" to PDFBox was indeed 2.0.24.  It was.  Is it at all possible the app and the library are different? > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issu

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Tilman Hausherr (Jira)
but not with 2.0.24. > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jira/browse/PDFBOX-5290 > Project: PDFBox > Issue Type: Bug >

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Eric R Manzitti (Jira)
.  Uhh hmm.  Okay thanks...Sorry I didn't try that first.   I assume when I do the ExtractText command line thingy, its using PDFTextStripper object? > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 >

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Maruan Sahyoun (Jira)
ExtractText - see https://pdfbox.apache.org/2.0/commandline.html#extracttext Do you get any error message? > ClassCastException during Text Extraction > - > > Key: PDFBOX-5290 > URL: https://issues.apache.org/jir

[jira] [Updated] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Maruan Sahyoun (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-5290: --- Attachment: newBroke.txt > ClassCastException during Text Extract

[jira] [Created] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-06 Thread Eric R Manzitti (Jira)
Eric R Manzitti created PDFBOX-5290: --- Summary: ClassCastException during Text Extraction Key: PDFBOX-5290 URL: https://issues.apache.org/jira/browse/PDFBOX-5290 Project: PDFBox Issue Type

[jira] [Resolved] (PDFBOX-5162) Text extraction lost

2021-04-12 Thread Jira
dictionary encodings when limiting the code length to one byte, which isn't correct. I've fixed that, so that only predefined encodings are taken into account. [~tilman] thanks for the pointer > Text extraction lost > > > Key: PDFBOX-5162 >

[jira] [Commented] (PDFBOX-5162) Text extraction lost

2021-04-12 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1888647 ] PDFBOX-5162: limit usage of one byte codes to predefined encodings > Text extraction lost > > > Key: PDFBOX-5162 > URL: https://issues.apache.org/jir

[jira] [Assigned] (PDFBOX-5162) Text extraction lost

2021-04-11 Thread Jira
[ https://issues.apache.org/jira/browse/PDFBOX-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-5162: -- Assignee: Andreas Lehmkühler > Text extraction l

[jira] [Updated] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5162: Fix Version/s: (was: 3.0.0 PDFBox) > Text extraction l

[jira] [Comment Edited] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS commoncrawl3/WS/WSWOPOYZHBVCGGBYSOGM7D4SYJ76SSFE commoncrawl3/HG/HGL6OAQ335IIDCGFBGDU33C76VL3NEYU > Text extraction lost > > > Key: PDFBOX-5162 > URL: https://issues.apache.org/jira/browse/PDFBOX-5162 >

[jira] [Comment Edited] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
: --- Same with commoncrawl3/5T/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS commoncrawl3/WS/WSWOPOYZHBVCGGBYSOGM7D4SYJ76SSFE commoncrawl3/HG/HGL6OAQ335IIDCGFBGDU33C76VL3NEYU was (Author: tilman): Same with commoncrawl3/5T/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS > Text extraction l

[jira] [Commented] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS > Text extraction lost > > > Key: PDFBOX-5162 > URL: https://issues.apache.org/jira/browse/PDFBOX-5162 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Aff

[jira] [Commented] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
to one byte codes by passing the origin length {code} This is from PDFBOX-4749. In this file, the ToUnicode stream has 2 bytes. Calling {{toUnicodeCMap.toUnicode(code)}} works, but then the text extraction tests fail. Btw Adobe Reader is not able to do a text extraction. > Text extraction l

[jira] [Updated] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5162: Description: Text extraction worked in 2.0.23, no longer in 3.0. Unicode values

[jira] [Created] (PDFBOX-5162) Text extraction lost

2021-04-10 Thread Tilman Hausherr (Jira)
Tilman Hausherr created PDFBOX-5162: --- Summary: Text extraction lost Key: PDFBOX-5162 URL: https://issues.apache.org/jira/browse/PDFBOX-5162 Project: PDFBox Issue Type: Bug

[jira] [Comment Edited] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

2021-03-09 Thread Jira
newPos.addAll(pos.subList(char2glyph.get(start), char2glyph.get(end))); } } return newPos; } } {code} I hope this is easier to adapt into an actual fix than working from scratch.   > Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width

[jira] [Commented] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

2021-03-09 Thread Jira
working from scratch.   > Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width > join, etc.) in a RTL context g

[jira] [Commented] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

2021-03-08 Thread Tilman Hausherr (Jira)
skilled in that field provides a fix. > Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width > join, etc.) in a RTL context get reversed incorrectly on text extr

[jira] [Created] (PDFBOX-5126) Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction

2021-03-08 Thread Jira
Gábor Stefanik created PDFBOX-5126: -- Summary: Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on text extraction Key: PDFBOX-5126

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-02-01 Thread Tilman Hausherr (Jira)
://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.23-SNAPSHOT/ > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > K

[jira] [Resolved] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread Jira
Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22 > Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10 >Reporter: sungwon kim >Assignee: Andreas

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1886093 ] PDFBOX-5090: add a test for an identity bfrange mapping > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1886092 ] PDFBOX-5090: add a test for an identity bfrange mapping > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread Tilman Hausherr (Jira)
Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22 > Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10 >Reporter: sungwon kim >Assignee: A

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread Jira
another index issue with the identity bfrange. Thanks for the check > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > K

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1886079 ] PDFBOX-5090: fix issue with bfrange for identity encoding > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-31 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1886078 ] PDFBOX-5090: fix issue with bfrange for identity encoding > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1886071 ] PDFBOX-5090: add more tests > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1886059 ] PDFBOX-5090: add more tests > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread Tilman Hausherr (Jira)
. > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > Key: PDFBOX-5090 > URL: https://issues.apache.org/jira/

[jira] [Updated] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5090: Attachment: PDFBOX-3442-DirectResources.pdf > Missing text extraction under cert

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread Jira
strict mode which is used for inline CMaps only. I've fixed another issue with bfranges which skipped the last mapping of a range if the overflow detection wasn't triggered. > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Assigned] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread Jira
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-5090: -- Assignee: Andreas Lehmkühler > Missing text extraction under cert

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1886058 ] PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1886057 ] PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1886056 ] PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1886055 ] PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1886054 ] PDFBOX-5090: test strict mode with overflow detection > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-30 Thread ASF subversion and git services (Jira)
le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1886053 ] PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Updated] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-29 Thread Jira
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-5090: --- Fix Version/s: 3.0.0 PDFBox 2.0.23 > Missing text extract

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273385#comment-17273385 ] Tilman Hausherr commented on PDFBOX-5090: - Yes that makes sense! > Missing text extract

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Jira
a ToUnicode CMap. WDYT? > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > Key: PDFBOX-5090 > URL: https://issues.apac

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Tilman Hausherr (Jira)
only 4 so that the last result is C5FF, see comment by [~mkl] in the related issue. I'm wondering whether Adobe uses a more "relaxed" approach for built-in tables? I also remember something about ranges being 2-dimensional sometimes?! > Missing text extraction under certain conditions

[jira] [Updated] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5090: Attachment: PDFBOX-5090_reduced.pdf > Missing text extraction under certain conditi

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Tilman Hausherr (Jira)
1868399 or 1868402. > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > Key: PDFBOX-5090 > URL: https://issues.apache.org/

[jira] [Updated] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5090: Labels: regression (was: ) > Missing text extraction under certain conditions start

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Tilman Hausherr (Jira)
7: 2019-09-20 2.0.18: 2019-12-23 > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > Key: PDFBOX-5090 > URL: https://iss

[jira] [Comment Edited] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Michael Klink (Jira)
.pdf] wants to point to the _bottom left of page 2_ (the line above section II) and the _top of page 3_ (line 2) respectively. > Missing text extraction under certain conditions starting with apache pdfbox >

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-27 Thread Michael Klink (Jira)
. > Missing text extraction under certain conditions starting with apache pdfbox > 2.0.18 > --- > > Key: PDFBOX-5090 > URL: https://issues.apache.org/jira/

[jira] [Commented] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-26 Thread Tilman Hausherr (Jira)
h any condition", but you attached images, not text. I tried with the 2.0.22 on "128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf" and I did get text extraction, see attachment. [^128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt] >From your images, it seems you mean a difference in text extraction. Where is >thi

[jira] [Updated] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-26 Thread Tilman Hausherr (Jira)
[ https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5090: Attachment: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt > Missing text extraction under cert

[jira] [Updated] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-26 Thread sungwon kim (Jira)
to extract text with any condition. It is suspected that the missing text extraction phenomenon is associated with either the font type or the font size or text's width and height.  I have attached the text extraction results of version 2.0.17 and version 2.0.18 and the sample data used

[jira] [Created] (PDFBOX-5090) Missing text extraction under certain conditions starting with apache pdfbox 2.0.18

2021-01-26 Thread sungwon kim (Jira)
sungwon kim created PDFBOX-5090: --- Summary: Missing text extraction under certain conditions starting with apache pdfbox 2.0.18 Key: PDFBOX-5090 URL: https://issues.apache.org/jira/browse/PDFBOX-5090

  1   2   3   4   5   6   7   8   9   10   >