On Sat, 15 Nov 2014, David Meikle wrote:
How can i do that?
You can set this using the TesseractOCRConfig class. It has a property
called language which can be set to a + separated list of supported
language models (i.e. the ones you have installed with your Tesseract
installation) using their ISO 639-2 codes. You then add this into the
ParseContext so you override the default use of the english model only.
ParseContext context = new ParseContext();
TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
ocrConfig.setLanguage("eng+fra+deu");
context.set(TesseractOCRConfig.class, ocrConfig);
The OP is using the Tika Server though. I guess we'd need to allow for an
extra header in the server to get this set on the context used in the
server's parsing?
I am using this in production now and have done some work to make
configuring the OCR Parser easier. Not had time to contribute this
back, will hopefully be able to do this whilst at ApacheCon EU.
I'll be there too, but slightly stressed with the number of talks I'm
giving, but I can hopefully offer a quick hand at some point :)
Nick