On Fri, Dec 2, 2016 at 4:36 PM, Chris Rogers
<chris.rog...@bodleian.ox.ac.uk> wrote:
> Hi all,
>
> A question regarding using the DIH FileListEntityProcessor with SolrCloud 
> (solr 6.3.0, zookeeper 3.4.8).
>
> I get that the config in SolrCloud lives on the Zookeeper node (a different 
> server from the solr nodes in my setup).
>
> With this in mind, where is the baseDir attribute in the 
> FileListEntityProcessor config relative to? I’m seeing the config in the Solr 
> GUI, and I’ve tried setting it as an absolute path on my Zookeeper server, 
> but this doesn’t seem to work… any ideas how this should be setup?
>
> My DIH config is below:
>
> <dataConfig>
>   <dataSource type="FileDataSource"/>
>   <document>
>     <!-- this outer processor generates a list of files satisfying the 
> conditions
>          specified in the attributes -->
>     <entity name="f" processor="FileListEntityProcessor"
>             fileName=".*xml"
>             newerThan="'NOW-5YEARS'"
>             recursive="true"
>             rootEntity="false"
>             dataSource="null"
>             baseDir="/home/bodl-zoo-svc/files/">
>
>       <!-- this processor extracts content using Xpath from each file found 
> -->
>
>       <entity name="tei" processor="XPathEntityProcessor"
>               forEach="/TEI" url="${f.fileAbsolutePath}" 
> transformer="RegexTransformer" >
>         <field column="manuscript_title" name="manuscript_title" 
> xpath="/TEI/teiHeader/fileDesc/titleStmt/title"/>
>         <field column="repository" name="repository" 
> xpath="/TEI/teiHeader/fileDesc/publicationStmt/publisher"/>
>         <field column="id" name="id" 
> xpath="/TEI/teiHeader/fileDesc/sourceDesc/msDesc/msIdentifier/altIdentifier/idno"/>
>       </entity>
>
>     </entity>
>
>   </document>
> </dataConfig>
>
>
> This same script worked as expected on a single solr node (i.e. not in 
> SolrCloud mode).
>
> Thanks,
> Chris
>

Hey Chris

We hit the same problem moving from non-cloud to cloud, we had a
collection that loaded its DIH config from various XML files listing
the DB queries to run. We wrote a simple DataSource plugin function to
load the config from Zookeeper instead of local disk to avoid having
to distribute those config files around the cluster.

https://issues.apache.org/jira/browse/SOLR-8557

Cheers

Tom

Reply via email to