On Sat, 15 Nov 2014, David Meikle wrote:
The OP is using the Tika Server though. I guess we'd need to allow for
an extra header in the server to get this set on the context used in
the server's parsing?
We could do something like this to allow users to set the language per
request - I am using the parser wrapped via its own server API, so all I
am doing is capturing a request parameter and then setting the context
to override a patched TesseractOCRConfig that loads from an external
properties file akin to the PDFConfig file. I will add that in at
least.
I personally don’t like custom headers that modify behaviour, although
you do see if in POST requests at times. Same difference really between
this and an optional parameter. Maybe the config file will be enough as
having added the above, I don’t see much difference between a call with
a single language and one with all languages configured.
Maybe we could say that the default Tika URL won't include tessaract. We
then provide another one that does bring it in, and offers parameters to
hint which languages to try for on that request?
Nick