Text extraction: locale handling?

Robert Neal Clayton Mon, 18 Jun 2018 10:13:41 -0700

Hello,

I’m getting started with Tika for the first time over the past few days, I’m 
running the latest (1.18) server jar and running some test PDFs through it for 
text extraction via CURL in a virtual machine.


Consider the sample page here…

https://www.scribd.com/document/382021926/Extract 
<https://www.scribd.com/document/382021926/Extract>

This text was OCR’d by me with Tesseract 4.0 with an en_US-UTF8 locale on 
FreeBSD 11.1-RELEASE

Standard letter characters work fine with this, but if I extract text from a 
machine that is not using the same English UTF8 charset, I’ll get the 
following, for example on the line containing the word “triangulating”:

We have also
made quite a few selections with an eye to pairing or triangulating?^`^tfor 
exam-
ple, we chose the famous closing section on writing from Plato?^`^ys Phaedrus,

…because the default ASCII character sets don’t have the same apostrophe and/or 
emdash.

Which makes me consider possibilities and conundrums:

How do people handle multiple languages with say… bulk/automated extraction 
involving multiple languages?

Text extraction: locale handling?

Reply via email to