Re: Setting tesseract properties when using tika-server

David Meikle Sat, 15 Nov 2014 04:19:40 -0800

Hello Milos,

> On 30 Oct 2014, at 15:06, Milos Kovacevic <[email protected]> wrote:
> 
>> On Thu, 30 Oct 2014, Milos Kovacevic wrote:
>>> I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr
>>> engine. I am curious how can i set different tesseract parameters such
>>> as
>>> default language or output format (hOCR) in a separate request to tika
>>> server?
>> 
>> I believe they can only be set once on a server-wide basis at the moment
> 
> How can i do that?


You can set this using the TesseractOCRConfig class.  It has a property called 
language which can be set to a + separated list of supported language models 
(i.e. the ones you have installed with your Tesseract installation) using their 
ISO 639-2 codes.  You then add this into the ParseContext so you override the 
default use of the english model only.

ParseContext context = new ParseContext();
TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
ocrConfig.setLanguage("eng+fra+deu");
context.set(TesseractOCRConfig.class, ocrConfig);

Then it is a case of using this ParseContext within the parser.

>> 
>> Could you explain a use case for wanting to change it on a per-request
>> basis, to help us understand?
> 
> Well, I have a lot of files written in different languages and alphabets.
> OCR performance depends on that info. So when I have to send let's say
> English file I'll set the language to eng and if the file is Serbian I'll
> set it to be SER. Tesseract uses language files to improve recognition
> performance.

I am using this in production now and have done some work to make configuring 
the OCR Parser easier.  Not had time to contribute this back, will hopefully be 
able to do this whilst at ApacheCon EU.

Cheers,
Dave

Re: Setting tesseract properties when using tika-server

Reply via email to