Re: Solr dih extract text from inline images in pdf

Erick Erickson Tue, 06 Mar 2018 07:23:44 -0800

It's often much easier to approach this by running Tika separately.
Here's a blog on both the reasoning and sample code:


https://lucidworks.com/2012/02/14/indexing-with-solrj/

Among other things, you have a lot more control over how Tika operates.

Best,
Erick

On Tue, Mar 6, 2018 at 12:36 AM, lala <[email protected]> wrote:
> Hi,
>
> I am working with solr7, indexing multilingual files existing in a folder,
> using DIH (FileListEntityProcessor for the basic entity, &
> TikaEntityProcessor for the child entity in configuration file).
>
> My problem relies here: I want to extract texts from images inside PDF
> files, that works fine with the /update/extract request handler where I set
> the "parseContext.config" attribute to an xml file lets say "context.xml"
> where I set the property "extractInlineImages" for the entry
> [PDFParserConfig] to true. But I have no Idea how to set the
> parseContext.Config in the DIH configuration??
>
> I tried these approaches, none of them worked:
>
>     - set tikaConfig attribute in dih config file to my "context.xml",
> obviously won't work since tika config is different :.
>     - set the parseContext.config attribute to my "\dataImport"
> requestHandler, didn't work
>
> I googled a lot with no result...I really really appreciate any help here!!
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr dih extract text from inline images in pdf

Reply via email to