}\uD835\uDC4B\u0302{color}
> Surrogate pairs with combining diacritics are incorrectly ordered on text
> extraction
>
>
> Key: PDFBOX-5747
> URL: https://issues.apac
irs with combining diacritics are incorrectly ordered on text
> extraction
>
>
> Key: PDFBOX-5747
> URL: https://issues.apache.org/jira/browse/PDFBOX-5747
>
.
The attached PDF contains 푋̂. This is composed of 푋, which is represented as
the surrogate pair
{color:#cc7832}\uD835\uDC4B{color}
and the combining diacritic,
{color:#cc7832}\u0302{color}
> Surrogate pairs with combining diacritics are incorrectly ordered on text
> extr
P Crossa created PDFBOX-5747:
Summary: Surrogate pairs with combining diacritics are incorrectly
ordered on text extraction
Key: PDFBOX-5747
URL: https://issues.apache.org/jira/browse/PDFBOX-5747
Project
didn't find any tag like
this in the PDF content.{quote}
Then please share the PDF for further analysis.
While you're right that in case of your document the text extraction result
would improve by _not_ trying to identify gaps, in general one needs this gap
detection.
> Wrong Text Extract
-2022-10-19-16-48-36-198.png!
I am trying to see if we have a similar document with the same behavior that we
can edit in order to remove the customer sensitive data. The document we are
testing at this moment is signed, and due to this I am unable to edit it and
remove the sensitive data.
&
didn't find any tag like this in
the PDF content.
> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
>
>
> Key: PDFBOX-5529
> URL: https://issues.apache.org/jir
if we have a similar document with the same behavior that we
can edit in order to remove the customer sensitive data. The document we are
testing at this moment is signed, and due to this I am unable to edit it and
remove the sensitive data.
> Wrong
with the same behavior that we
can edit in order to remove the customer sensitive data. The document we are
testing at this moment is signed, and due to this I am unable to edit it and
remove the sensitive data.
> Wrong
ior that we
can edit in order to remove the customer sensitive data. The document we are
testing at this moment is signed, and due to this I am unable to edit it and
remove the sensitive data.
> Wrong Text Extraction - Unwanted Ex
[
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Carlos Alfonso Maya updated PDFBOX-5529:
Attachment: image-2022-10-19-16-48-36-198.png
> Wrong Text Extraction - Unwan
file has such tags or not. The easiest
option would be for you to share the file (or at least a page of it with that
behavior).
> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
>
>
>
layout, so you get a
terrible text extraction. I don't know why Adobe gets correct text. Maybe they
use a dictionary approach.
> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
>
>
>
answer that by staring at the code. We'll need some sort of a sample pdf
to debug that piece of code.
> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
>
>
> Key:
text from financial PDF
documents.
We have been using PDFBox since a long time back, and we have detected a
problem related to a bad text extraction on PDFs from a Customer.
Since we worked with Customer Data we cannot shared the PDF besides that are
signed and we cannot even edit them
text from financial PDF
documents.
We have been using PDFBox since a long time back, and we have detected a
problem related to a bad text extraction on PDFs from a Customer.
Since we worked with Customer Data we cannot shared the PDF besides that are
signed and we cannot even edit them
Carlos Alfonso Maya created PDFBOX-5529:
---
Summary: Wrong Text Extraction - Unwanted Extra Spaces in the
middle of words
Key: PDFBOX-5529
URL: https://issues.apache.org/jira/browse/PDFBOX-5529
port of PDFBox
[available|https://github.com/TomRoush/PdfBox-Android]
> Text Extraction on Android
> --
>
> Key: PDFBOX-586
> URL: https://issues.apache.org/jira/browse/PDFBOX-586
> Project: PDFBox
> Is
I wouldn't. :D
On Thu, Jun 16, 2022 at 12:16 PM Tilman Hausherr
wrote:
> Am 15.06.2022 um 12:19 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>
> govdocs1/372/372582.pdf
> commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5
>
Am 15.06.2022 um 12:19 schrieb Tim Allison:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
govdocs1/372/372582.pdf
commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5
commoncrawl3/VN/VNCWMY6Y4C3XYWA65CQPPSNZSY6OQEEA
have lost text. But the first one is a
ehmkuehler <
andr...@lehmi.de> wrote:
Am 06.05.22 um 14:30 schrieb Tim Allison:
All,
Let me know when makes sense to run the text extraction
regression
Yes, it'd be useful to have some update results.
How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2
vs.
3.0.0-
, May 8, 2022 at 1:21 PM Andreas Lehmkuehler
wrote:
Am 06.05.22 um 14:30 schrieb Tim Allison:
All,
Let me know when makes sense to run the text extraction
regression
Yes, it'd be useful to have some update results.
How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs
>>> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>>> >>> Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26.
>>>> The
>>>> >>> reports are here:
>>>> >>>
>>>> https://corpora.tika.apach
>>
>>> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>> >>> Apologies for my delay. I ran trunk/3.x on May 12 against 2.0.26.
>>> The
>>> >>> reports are here:
>>> >>>
>>> https://corpora.tika.apache.org/base/report
t;> >>> reports are here:
>> >>>
>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>> >>>
>> >>> Happy to rerun with a more recent version of trunk.
>> >>>
>> >>> Cheers,
&g
on May 12 against 2.0.26. The
> >>> reports are here:
> >>>
> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
> >>>
> >>> Happy to rerun with a more recent version of trunk.
> >>>
> >>> Cheers,
://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
Happy to rerun with a more recent version of trunk.
Cheers,
Tim
On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler wrote:
Am 06.05.22 um 14:30 schrieb Tim Allison:
All,
Let me know when makes sense to run the text extraction
um 14:30 schrieb Tim Allison:
All,
Let me know when makes sense to run the text extraction regression
Yes, it'd be useful to have some update results.
How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
3.0.0-alpha3?
tests for 3.x. I regret I haven't been following our
py to rerun with a more recent version of trunk.
> >
> > Cheers,
> >
> >Tim
> >
> > On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler
> wrote:
> >
> >> Am 06.05.22 um 14:30 schrieb Tim Allison:
> >>> All,
> >>>
/base/reports/reports_pdfbox_3x_20220512.tgz
Happy to rerun with a more recent version of trunk.
Cheers,
Tim
On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler wrote:
Am 06.05.22 um 14:30 schrieb Tim Allison:
All,
Let me know when makes sense to run the text extraction regression
:
> Am 06.05.22 um 14:30 schrieb Tim Allison:
> > All,
> >Let me know when makes sense to run the text extraction regression
> Yes, it'd be useful to have some update results.
>
> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
> 3.0.0-alpha3
Am 06.05.22 um 14:30 schrieb Tim Allison:
All,
Let me know when makes sense to run the text extraction regression
Yes, it'd be useful to have some update results.
How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
3.0.0-alpha3?
tests for 3.x. I regret I haven't been
All,
Let me know when makes sense to run the text extraction regression
tests for 3.x. I regret I haven't been following our mailing list as
closely as I should be.
Cheers,
Tim
[
https://issues.apache.org/jira/browse/PDFBOX-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-5406.
---
Resolution: Not A Bug
> Assumption of Identity Not Valid for Text Extract
also cases where Adobe Reader
brings trash. Some files have a /ToUnicode map and still return trash.
We don't have a "strict" setting because there's no simple solution. Use a
word dictionary to detect whether the output is trash, and then run OCR.
> Assumptio
where Adobe Reader brings
trash. Some files have a /ToUnicode map and still return trash.
We don't have a "strict" setting because there's no simple solution. Use a word
dictionary to detect whether the output is trash, and then run OCR.
> Assumption of Identity Not Valid for Te
Michael Tighe created PDFBOX-5406:
-
Summary: Assumption of Identity Not Valid for Text Extraction
Key: PDFBOX-5406
URL: https://issues.apache.org/jira/browse/PDFBOX-5406
Project: PDFBox
ing Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
>
[
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr reopened PDFBOX-5290:
-
> ClassCastException during Text Extract
was using an older jar
file of pdfbox
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>
very much for the patience
and guidance.
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>
PDFBox for extract text,
but that caused more issues that resolutions) so I am going to be getting this
sorted out soon, should be today or tomorrow.
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
>
h instance...
was (Author: eric292):
I will test this today, and let y'all know. I am skeptical because I don't see
how a fresh built instance with the 2.0.24 version in the pom.xml would
possibly get a different version on a newly created "build-image"
> ClassCastException
skeptical because I don't see
how a fresh built instance with the 2.0.24 version in the pom.xml would
possibly get a different version on a newly created "build-image"
> ClassCastException during Text Extraction
> -
>
>
):
No they're the same. Please try a clean build / remove all old versions from
the classpath, i.e. look into the directories what's there. If it still
happens, please share the stack trace.
> ClassCastException during Text Extraction
> -
>
>
/ remove all old versions from
the classpath, i.e. look into the directories what's there. If it still
happens, please share the stack trace.
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
>
ernal dependency" to PDFBox was
indeed 2.0.24. It was. Is it at all possible the app and the library are
different?
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issu
but not with 2.0.24.
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
> Issue Type: Bug
>
. Uhh hmm. Okay
thanks...Sorry I didn't try that first.
I assume when I do the ExtractText command line thingy, its using
PDFTextStripper object?
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
>
ExtractText - see
https://pdfbox.apache.org/2.0/commandline.html#extracttext
Do you get any error message?
> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jir
[
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Maruan Sahyoun updated PDFBOX-5290:
---
Attachment: newBroke.txt
> ClassCastException during Text Extract
Eric R Manzitti created PDFBOX-5290:
---
Summary: ClassCastException during Text Extraction
Key: PDFBOX-5290
URL: https://issues.apache.org/jira/browse/PDFBOX-5290
Project: PDFBox
Issue Type
dictionary encodings when limiting the code length to one
byte, which isn't correct. I've fixed that, so that only predefined encodings
are taken into account.
[~tilman] thanks for the pointer
> Text extraction lost
>
>
> Key: PDFBOX-5162
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1888647 ]
PDFBOX-5162: limit usage of one byte codes to predefined encodings
> Text extraction lost
>
>
> Key: PDFBOX-5162
> URL: https://issues.apache.org/jir
[
https://issues.apache.org/jira/browse/PDFBOX-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler reassigned PDFBOX-5162:
--
Assignee: Andreas Lehmkühler
> Text extraction l
[
https://issues.apache.org/jira/browse/PDFBOX-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5162:
Fix Version/s: (was: 3.0.0 PDFBox)
> Text extraction l
/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS
commoncrawl3/WS/WSWOPOYZHBVCGGBYSOGM7D4SYJ76SSFE
commoncrawl3/HG/HGL6OAQ335IIDCGFBGDU33C76VL3NEYU
> Text extraction lost
>
>
> Key: PDFBOX-5162
> URL: https://issues.apache.org/jira/browse/PDFBOX-5162
>
:
---
Same with
commoncrawl3/5T/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS
commoncrawl3/WS/WSWOPOYZHBVCGGBYSOGM7D4SYJ76SSFE
commoncrawl3/HG/HGL6OAQ335IIDCGFBGDU33C76VL3NEYU
was (Author: tilman):
Same with
commoncrawl3/5T/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS
> Text extraction l
/5TZOONHYZ4MTL6EW2ASQPBNUNC5XEIBS
> Text extraction lost
>
>
> Key: PDFBOX-5162
> URL: https://issues.apache.org/jira/browse/PDFBOX-5162
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
>Aff
to one byte codes by passing the origin length
{code}
This is from PDFBOX-4749. In this file, the ToUnicode stream has 2 bytes.
Calling {{toUnicodeCMap.toUnicode(code)}} works, but then the text extraction
tests fail.
Btw Adobe Reader is not able to do a text extraction.
> Text extraction l
[
https://issues.apache.org/jira/browse/PDFBOX-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5162:
Description: Text extraction worked in 2.0.23, no longer in 3.0. Unicode
values
Tilman Hausherr created PDFBOX-5162:
---
Summary: Text extraction lost
Key: PDFBOX-5162
URL: https://issues.apache.org/jira/browse/PDFBOX-5162
Project: PDFBox
Issue Type: Bug
newPos.addAll(pos.subList(char2glyph.get(start), char2glyph.get(end)));
}
}
return newPos;
}
}
{code}
I hope this is easier to adapt into an actual fix than working from scratch.
> Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width
working from scratch.
> Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width
> join, etc.) in a RTL context g
skilled in that field
provides a fix.
> Complex Unicode glyphs (surrogate pairs, combining diacritics, zero-width
> join, etc.) in a RTL context get reversed incorrectly on text extr
Gábor Stefanik created PDFBOX-5126:
--
Summary: Complex Unicode glyphs (surrogate pairs, combining
diacritics, zero-width join, etc.) in a RTL context get reversed incorrectly on
text extraction
Key: PDFBOX-5126
://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.23-SNAPSHOT/
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> K
Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
>Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
> Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
>Reporter: sungwon kim
>Assignee: Andreas
le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1886093 ]
PDFBOX-5090: add a test for an identity bfrange mapping
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886092 ]
PDFBOX-5090: add a test for an identity bfrange mapping
> Missing text extraction under certain conditions starting with apache pdfbox
>
Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
>Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
> Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
>Reporter: sungwon kim
>Assignee: A
another index issue with the
identity bfrange. Thanks for the check
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> K
le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1886079 ]
PDFBOX-5090: fix issue with bfrange for identity encoding
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886078 ]
PDFBOX-5090: fix issue with bfrange for identity encoding
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1886071 ]
PDFBOX-5090: add more tests
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886059 ]
PDFBOX-5090: add more tests
> Missing text extraction under certain conditions starting with apache pdfbox
>
.
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> Key: PDFBOX-5090
> URL: https://issues.apache.org/jira/
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5090:
Attachment: PDFBOX-3442-DirectResources.pdf
> Missing text extraction under cert
strict mode which is used for
inline CMaps only. I've fixed another issue with bfranges which skipped the
last mapping of a range if the overflow detection wasn't triggered.
> Missing text extraction under certain conditions starting with apache pdfbox
>
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler reassigned PDFBOX-5090:
--
Assignee: Andreas Lehmkühler
> Missing text extraction under cert
le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1886058 ]
PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1886057 ]
PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1886056 ]
PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886055 ]
PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886054 ]
PDFBOX-5090: test strict mode with overflow detection
> Missing text extraction under certain conditions starting with apache pdfbox
>
le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886053 ]
PDFBOX-5090: strict mode with overflow detection is limited to CMaps within PDFs
> Missing text extraction under certain conditions starting with apache pdfbox
>
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-5090:
---
Fix Version/s: 3.0.0 PDFBox
2.0.23
> Missing text extract
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17273385#comment-17273385
]
Tilman Hausherr commented on PDFBOX-5090:
-
Yes that makes sense!
> Missing text extract
a ToUnicode CMap. WDYT?
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> Key: PDFBOX-5090
> URL: https://issues.apac
only 4 so that the
last result is C5FF, see comment by [~mkl] in the related issue.
I'm wondering whether Adobe uses a more "relaxed" approach for built-in tables?
I also remember something about ranges being 2-dimensional sometimes?!
> Missing text extraction under certain conditions
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5090:
Attachment: PDFBOX-5090_reduced.pdf
> Missing text extraction under certain conditi
1868399 or 1868402.
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> Key: PDFBOX-5090
> URL: https://issues.apache.org/
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5090:
Labels: regression (was: )
> Missing text extraction under certain conditions start
7: 2019-09-20
2.0.18: 2019-12-23
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> Key: PDFBOX-5090
> URL: https://iss
.pdf] wants to point to the _bottom left of page 2_ (the line
above section II) and the _top of page 3_ (line 2) respectively.
> Missing text extraction under certain conditions starting with apache pdfbox
>
.
> Missing text extraction under certain conditions starting with apache pdfbox
> 2.0.18
> ---
>
> Key: PDFBOX-5090
> URL: https://issues.apache.org/jira/
h any condition", but you attached
images, not text. I tried with the 2.0.22 on "128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf"
and I did get text extraction, see attachment.
[^128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt]
>From your images, it seems you mean a difference in text extraction. Where is
>thi
[
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-5090:
Attachment: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt
> Missing text extraction under cert
to extract text with any condition.
It is suspected that the missing text extraction phenomenon is associated with
either the font type or the font size or text's width and height.
I have attached the text extraction results of version 2.0.17 and version
2.0.18 and the sample data used
sungwon kim created PDFBOX-5090:
---
Summary: Missing text extraction under certain conditions starting
with apache pdfbox 2.0.18
Key: PDFBOX-5090
URL: https://issues.apache.org/jira/browse/PDFBOX-5090
1 - 100 of 901 matches
Mail list logo