Re: Setting tesseract properties when using tika-server

Nick Burch Sat, 15 Nov 2014 06:41:38 -0800

On Sat, 15 Nov 2014, David Meikle wrote:

How can i do that?
You can set this using the TesseractOCRConfig class. It has a propertycalled language which can be set to a + separated list of supportedlanguage models (i.e. the ones you have installed with your Tesseractinstallation) using their ISO 639-2 codes. You then add this into theParseContext so you override the default use of the english model only.
ParseContext context = new ParseContext();
TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
ocrConfig.setLanguage("eng+fra+deu");
context.set(TesseractOCRConfig.class, ocrConfig);

The OP is using the Tika Server though. I guess we'd need to allow for anextra header in the server to get this set on the context used in theserver's parsing?

I am using this in production now and have done some work to makeconfiguring the OCR Parser easier. Not had time to contribute thisback, will hopefully be able to do this whilst at ApacheCon EU.

I'll be there too, but slightly stressed with the number of talks I'mgiving, but I can hopefully offer a quick hand at some point :)


Nick

Re: Setting tesseract properties when using tika-server

Reply via email to