Hi Nick,

> On 15 Nov 2014, at 15:39, Nick Burch <[email protected]> wrote:
> 
> On Sat, 15 Nov 2014, David Meikle wrote:
>>> How can i do that?
>> 
>> You can set this using the TesseractOCRConfig class.  It has a property 
>> called language which can be set to a + separated list of supported language 
>> models (i.e. the ones you have installed with your Tesseract installation) 
>> using their ISO 639-2 codes.  You then add this into the ParseContext so you 
>> override the default use of the english model only.
>> 
>> ParseContext context = new ParseContext();
>> TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
>> ocrConfig.setLanguage("eng+fra+deu");
>> context.set(TesseractOCRConfig.class, ocrConfig);
> 
> The OP is using the Tika Server though. I guess we'd need to allow for an 
> extra header in the server to get this set on the context used in the 
> server's parsing?

We could do something like this to allow users to set the language per request 
- I am using the parser wrapped via its own server API, so all I am doing is 
capturing a request parameter and then setting the context to override a 
patched TesseractOCRConfig that loads from an external properties file akin to 
the PDFConfig file.  I will add that in at least.

I personally don’t like custom headers that modify behaviour, although you do 
see if in POST requests at times.  Same difference really between this and an 
optional parameter.  Maybe the config file will be enough as having added the 
above, I don’t see much difference between a call with a single language and 
one with all languages configured.

>> I am using this in production now and have done some work to make 
>> configuring the OCR Parser easier.  Not had time to contribute this back, 
>> will hopefully be able to do this whilst at ApacheCon EU.
> 
> I'll be there too, but slightly stressed with the number of talks I'm giving, 
> but I can hopefully offer a quick hand at some point :)

I had noticed you are doing a quite a few sessions! See you soon.

Cheers,
Dave

Reply via email to