Hello,
I would gladly welcome the reply of the community on the following subject:
We are using Tika embedded in Solr server.
I would like to know if it is possible to give in input to TesseractOCR,
run by Solr Extractor, a specific config file without the need of
recompiling any source code.
Instead of the default TesseractOCRConfig.properties, packaged inside
Tika JAR, we must use our own overriding some parameters.
For the moment, we modified the Tika source code and replaced the body
of TesseractOCRConfig default constructor,
from:
init(this.getClass().getResourceAsStream("TesseractOCRConfig.properties")).
to:
init(new
FileInputStream("/opt/datafari/tomcat/conf/datafari-tika-ocr.properties"));
Now, we would like to have a cleaner solution to the problem.
I had a look to Tika source code and TesseractOCRConfig also has the
constructor with parameter:
public TesseractOCRConfig(InputStream is) {
init(is);
}
With this method of TesseractOCRParser:
public Set<MediaType> getSupportedTypes(ParseContext context) {
// If Tesseract is installed, offer our supported image types
TesseractOCRConfig config =
context.get(TesseractOCRConfig.class, DEFAULT_CONFIG);
if (hasTesseract(config))
return SUPPORTED_TYPES;
// Otherwise don't advertise anything, so the other image parsers
// can be selected instead
return Collections.emptySet();
}
looks like it's possible to pass an instance of TesseractOCRConfig by
the means of a ParseContext.
If the input instance is defined, then the code uses that one, otherwise
creates a default instance.
The TesseractOCR instance in input might be created by the constructor
with parameter, passing an input stream reading from our own file:
/opt/datafari/tomcat/conf/datafari-tika-ocr.properties
So, do you know if is it possible to call Tika from Solr passing a
specific context?
And, if it's doable, any hints on how to do it?
FYI: we are using Tika for our open source enterprise search engine
"Datafari".
Thanks and
--
Best regards,
*Giovanni Usai
* [email protected] <mailto:[email protected]>
www.francelabs.com <http://www.francelabs.com/>
CEEI Nice Premium
1 Bd. MaƮtre Maurice Slama
06200 Nice FRANCE
Ph: +33 (0)9 72 43 72 85