It's often much easier to approach this by running Tika separately. Here's a blog on both the reasoning and sample code:
https://lucidworks.com/2012/02/14/indexing-with-solrj/ Among other things, you have a lot more control over how Tika operates. Best, Erick On Tue, Mar 6, 2018 at 12:36 AM, lala <labisha...@gmail.com> wrote: > Hi, > > I am working with solr7, indexing multilingual files existing in a folder, > using DIH (FileListEntityProcessor for the basic entity, & > TikaEntityProcessor for the child entity in configuration file). > > My problem relies here: I want to extract texts from images inside PDF > files, that works fine with the /update/extract request handler where I set > the "parseContext.config" attribute to an xml file lets say "context.xml" > where I set the property "extractInlineImages" for the entry > [PDFParserConfig] to true. But I have no Idea how to set the > parseContext.Config in the DIH configuration?? > > I tried these approaches, none of them worked: > > - set tikaConfig attribute in dih config file to my "context.xml", > obviously won't work since tika config is different :. > - set the parseContext.config attribute to my "\dataImport" > requestHandler, didn't work > > I googled a lot with no result...I really really appreciate any help here!! > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html