It's often much easier to approach this by running Tika separately.
Here's a blog on both the reasoning and sample code:
Among other things, you have a lot more control over how Tika operates.
On Tue, Mar 6, 2018 at 12:36 AM, lala <labisha...@gmail.com> wrote:
> I am working with solr7, indexing multilingual files existing in a folder,
> using DIH (FileListEntityProcessor for the basic entity, &
> TikaEntityProcessor for the child entity in configuration file).
> My problem relies here: I want to extract texts from images inside PDF
> files, that works fine with the /update/extract request handler where I set
> the "parseContext.config" attribute to an xml file lets say "context.xml"
> where I set the property "extractInlineImages" for the entry
> [PDFParserConfig] to true. But I have no Idea how to set the
> parseContext.Config in the DIH configuration??
> I tried these approaches, none of them worked:
> - set tikaConfig attribute in dih config file to my "context.xml",
> obviously won't work since tika config is different :.
> - set the parseContext.config attribute to my "\dataImport"
> requestHandler, didn't work
> I googled a lot with no result...I really really appreciate any help here!!
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html