Am 04.01.2018 um 20:57 schrieb Luca Loiodice:
Thanks.
Any chance I can add the conversion as a post processing step and avoid
having to build from source?
No... ExtractText returns nothing fpr these glyphs.
It's different if you work with TextPosition... then in theory, you
could try to replicate all the steps from the source code.
Because I get the code back as part of the extracted text ... so I was
wondering if I can load the font from the PDF
and use the code -> glyph name matrix to replace the code with the
character.
In that case I am not sure how I can load the data from the font ... but I
see the debugger is able to do it.
That is also available in source.
But if I were you, I wouldn't bother with such files. There are many
files where there is no unicode available. You'll have to use OCR for that.
Tilman
*Luca Loiodice |* Software Architect
*T: *713 231 9100 *F: *713 583 1131 *C:* 512 577 6677
4400 Post Oak Parkway, Suite 2700, Houston, TX 77027
Follow Us: Facebook <https://t.xink.io/Tracking/Index/vwUAACcuAAAqdCYA0> |
LinkedIn <https://t.xink.io/Tracking/Index/wAUAACcuAAAqdCYA0> | Twitter
<https://t.xink.io/Tracking/Index/wQUAACcuAAAqdCYA0> | Youtube
<https://t.xink.io/Tracking/Index/wgUAACcuAAAqdCYA0>
On Thu, Jan 4, 2018 at 7:28 PM, Tilman Hausherr <[email protected]>
wrote:
Am 04.01.2018 um 20:20 schrieb Luca Loiodice:
I am trying to migrate a project from a commercial Windows PDF library to
PDFBox, but I see reduced accuracy when I extract text from arbitrary files.
For example, I have a PDF (enclosed) that does not have Unicode mappings
for certain glyph ... and so when I try and extract the text using PDF Box
I get the following:
Attachments are swallowed, you'd need to upload to a sharehoster.
WARNING: No Unicode mapping for G70 (112) in font HAGLDF+MSTT31c5ed
Jan 04, 2018 10:24:02 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont
toUnicode
The Windows library returns the correct text for the gliph with missing
character mapping.
Is there a way for me to add some code to make PDFBox or my program
figure out what the text is in this case ?
Yes, but you'd need to build from source because G70 is non standard, the
change is described in
https://issues.apache.org/jira/browse/PDFBOX-3962
at the bottom.
Tilman
Thanks for any help,
Luca
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]