On Sat, 15 Nov 2014, David Meikle wrote:
How can i do that?

You can set this using the TesseractOCRConfig class. It has a property called language which can be set to a + separated list of supported language models (i.e. the ones you have installed with your Tesseract installation) using their ISO 639-2 codes. You then add this into the ParseContext so you override the default use of the english model only.

ParseContext context = new ParseContext();
TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
ocrConfig.setLanguage("eng+fra+deu");
context.set(TesseractOCRConfig.class, ocrConfig);

The OP is using the Tika Server though. I guess we'd need to allow for an extra header in the server to get this set on the context used in the server's parsing?

I am using this in production now and have done some work to make configuring the OCR Parser easier. Not had time to contribute this back, will hopefully be able to do this whilst at ApacheCon EU.

I'll be there too, but slightly stressed with the number of talks I'm giving, but I can hopefully offer a quick hand at some point :)

Nick

Reply via email to