On 07/03/2018 13:29, lala wrote:
Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in
You're missing Charlie's point, and if you read the blog I pointed you
to that point is reiterated.
DIH does the Tika processing on the Solr node that is _also_ indexing
documents and satisfying queries. Parsing a semi-structured document
(PDF in this case) consumes CPU cycles and memory, all _wit
I dont' know what is the problem, when posting the message, the xml format
inside the is not correct, it should contain ["<"param
name="extractInlineImages" type="bool">true] AND ["<"param
name="sortByPosition" type="bool">true]...
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f47
Thanks Charlie...
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to overr
On 07/03/2018 09:32, lala wrote:
Thanks for your reply Erick,
Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser...
W
Thanks for your reply Erick,
Actually I am using Solrj to index files among other operations with Solr,
but to index a large amount of differesnt kinds of file, I'm sending a DIH
request to Solr using Solrj API : FileListEntityProcessor with
TikaEntityParser...
Why not benefit from this technolog
It's often much easier to approach this by running Tika separately.
Here's a blog on both the reasoning and sample code:
https://lucidworks.com/2012/02/14/indexing-with-solrj/
Among other things, you have a lot more control over how Tika operates.
Best,
Erick
On Tue, Mar 6, 2018 at 12:36 AM, la
Hi,
I am working with solr7, indexing multilingual files existing in a folder,
using DIH (FileListEntityProcessor for the basic entity, &
TikaEntityProcessor for the child entity in configuration file).
My problem relies here: I want to extract texts from images inside PDF
files, that works fine