Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Charlie Hull
On 07/03/2018 13:29, lala wrote: Thanks Charlie... It's just confusing for me, In the DIH configuration file, the inner entity that takes "TikaEntityProcessor" as its processor, I can easily specify a tikaConfig attribute to an xml file, located inside the config folder in the core, and where in

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Erick Erickson
You're missing Charlie's point, and if you read the blog I pointed you to that point is reiterated. DIH does the Tika processing on the Solr node that is _also_ indexing documents and satisfying queries. Parsing a semi-structured document (PDF in this case) consumes CPU cycles and memory, all _wit

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread lala
I dont' know what is the problem, when posting the message, the xml format inside the is not correct, it should contain ["<"param name="extractInlineImages" type="bool">true] AND ["<"param name="sortByPosition" type="bool">true]... -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f47

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread lala
Thanks Charlie... It's just confusing for me, In the DIH configuration file, the inner entity that takes "TikaEntityProcessor" as its processor, I can easily specify a tikaConfig attribute to an xml file, located inside the config folder in the core, and where in this file I should be able to overr

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread Charlie Hull
On 07/03/2018 09:32, lala wrote: Thanks for your reply Erick, Actually I am using Solrj to index files among other operations with Solr, but to index a large amount of differesnt kinds of file, I'm sending a DIH request to Solr using Solrj API : FileListEntityProcessor with TikaEntityParser... W

Re: Solr dih extract text from inline images in pdf

2018-03-07 Thread lala
Thanks for your reply Erick, Actually I am using Solrj to index files among other operations with Solr, but to index a large amount of differesnt kinds of file, I'm sending a DIH request to Solr using Solrj API : FileListEntityProcessor with TikaEntityParser... Why not benefit from this technolog

Re: Solr dih extract text from inline images in pdf

2018-03-06 Thread Erick Erickson
It's often much easier to approach this by running Tika separately. Here's a blog on both the reasoning and sample code: https://lucidworks.com/2012/02/14/indexing-with-solrj/ Among other things, you have a lot more control over how Tika operates. Best, Erick On Tue, Mar 6, 2018 at 12:36 AM, la

Solr dih extract text from inline images in pdf

2018-03-06 Thread lala
Hi, I am working with solr7, indexing multilingual files existing in a folder, using DIH (FileListEntityProcessor for the basic entity, & TikaEntityProcessor for the child entity in configuration file). My problem relies here: I want to extract texts from images inside PDF files, that works fine