Maarten,

Here is a sample set-up that lets you build your caches in parallel and then 
index off the caches in a subsequent step.  See below for the solrconfig.xml 
snippet and the text of the 4 data-config.xml files.  In this example it builds 
a cache for the parent also, but this is not necessary.  But I guess its 
cleaner looking to just cache everything and then the final step works against 
caches only.

Here's how it works.  First, begin a full import for each of the cache builders 
by issuing these commands all at once.  Each of these builds a cache:
/solrcore/dih-parent?command=full-import
/solrcore/dih-child1?command=full-import
/solrcore/dih-child2?command=full-import

You then need to poll each of these handler's status screen and wait until they 
all finish.  Once done, issue this command.  This reads back the caches and 
indexes the data to your solr core:
/solrcore/dih-master?command=full-import

The tricky thing here is automating it all.  You'll need something that issues 
the commands and then polls the responses, etc.  For my case, I ended up 
writing a very hacky program that runs 12 cache-building handlers at once, 
starting a new one when one finishes, until all 50 or so are complete.  It then 
runs the master dih handlers (an additional complexity for our situation, not 
shown here, is I'm using the DIH Cache partitioning feature to make multiple 
partitions, then I have multiple master handlers that each index a slice of the 
data at the same time, making the "master" step finish faster on a 
multi-processor machine)

Another thing that is very confusing with all this is that to build the caches, 
you send all the cache params as request parameters, included in solrconfig.xml 
here.  But for the master indexing, these are parameters on the entity in 
data-config.xml.  It would be better (perhaps should this feature ever get 
committed) maybe if this changed to allow all the configuration to occur in 
data-config.xml for both building caches and reading caches.

One last thing is you might want to open a JIRA issue about JDBCDataSource not 
honoring the JDBC Driver parameter that you're trying to pass through.  
https://issues.apache.org/jira/browse/SOLR  If you don't have an account you 
need to create one to open a new issue.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

<!-- 4 handlers declared in solrconfig.xml -->
<requestHandler name="/dih-parent" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-parent.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">PARENT</str>
    <str 
name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <!-- ID is Oracle's "number" type which the JDBC driver brings in as a 
BigDecimal.  
         The field always contains an Integer so we can optimize for that case 
         See org.apache.solr.handler.dataimport.DIHCacheTypes
    -->
    <str name="persistCacheFieldNames">ID,                 SOME_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">ID</str>
    <str 
name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <!-- all bdb-je caches being built at the same time share this 100mb cache 
-->
    <str name="berkleyInternalCacheSize">100000000</str> 
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-child1" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-child1.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">CHILD1</str>
    <str 
name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <str name="persistCacheFieldNames">PARENT_ID,          CHILD_ONE_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">PARENT_ID</str>
    <str 
name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <str name="berkleyInternalCacheSize">100000000</str>
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-child2" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">dataconfig-child2.xml</str>
    <str name="clean">true</str>
    <str name="persistCacheBaseDir">/path/to/caches</str>
    <str name="persistCacheName">CHILD2</str>
    <str 
name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str>
    <str name="persistCacheFieldNames">PARENT_ID,          CHILD_TWO_DATA</str> 
    <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str>
    <str name="cacheKey">PARENT_ID</str>
    <str 
name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
    <str name="berkleyInternalCacheSize">100000000</str>
    <str name="berkleyInternalShared">true</str>
  </lst>
</requestHandler>
<requestHandler name="/dih-master" 
class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">dataconfig-master.xml</str>
  <str name="clean">true</str>
  <str name="commit">true</str>
  <str name="optimize">false</str>
</lst>
</requestHandler>


<!-- dataconfig-parent.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="PARENT">
    <entity name="PARENT" dataSource="zzz" query="SELECT ID, SOME_DATA FROM 
PARENT" />
  </document>
</dataConfig>

<!-- dataconfig-child1.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="CHILD1">
    <entity name="CHILD1" dataSource="zzz" query="SELECT PARENT_ID, 
CHILD_ONE_DATA FROM CHILD1" />
  </document>
</dataConfig>

<!-- dataconfig-child2.xml -->
<dataConfig>
  <dataSource name="zzz" driver="xxx" url="ccc" />
  <document name="CHILD2">
    <entity name="CHILD2" dataSource="zzz" query="SELECT PARENT_ID, 
CHILD_TWO_DATA FROM CHILD2" />
  </document>
</dataConfig>

<!-- dataconfig-master.xml -->
<dataConfig>
  <document name="MASTER">
    <entity name="PARENT"
      processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
      cacheKey="ID"      
      persistCacheBaseDir="/path/to/caches"
      persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
      persistCacheName="PARENT"
      berkleyInternalCacheSize="100000000" <!-- all bdb-je caches share this 
100mb cache -->
      berkleyInternalShared="true"
    >
      <entity
        name="CHILD1"
        processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
        cacheKey="PARENT_ID"
        cacheLookup="PARENT.ID"        
        persistCacheBaseDir="/path/to/caches"
        persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
        persistCacheName="CHILD1"
        berkleyInternalCacheSize="100000000"
        berkleyInternalShared="true"        
      />
      <entity
        name="CHILD2"
        processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
        cacheKey="PARENT_ID"
        cacheLookup="PARENT.ID"        
        persistCacheBaseDir="/path/to/caches"
        persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache"
        persistCacheName="CHILD2"        
        berkleyInternalCacheSize="100000000"
        berkleyInternalShared="true"
      />
    </entity>
  </document>
</dataConfig>




-----Original Message-----
From: mroosendaal [mailto:mroosend...@yahoo.com] 
Sent: Friday, November 16, 2012 8:19 AM
To: solr-user@lucene.apache.org
Subject: RE: DIH nested entities don't work

Hi,

You are correct about not wanting to index everything every day, however for
this PoC i need a 'bootstrap' mechanism which basically does what Endeca
does.

The 'defaultRowPrefetch' in the solrconfig.xml does not seem to take, i'll
have a closer look.

With the long time, it appeard that one of the views i was reading was also
by far the biggest with over 4mln entries. Other views should take much less
time.

With regards to the parallel processing, i have the 2 classes you mention
and packaged them. The documentation in the patch was not clear on how to
exactly do that. My assumption is that
* for every entity you have to define a DIH in the solrconfig and refer to
aspecific data-config-<entity>.xml
* define 1 importhandler for the join in the solrconfig 
* what isn't clear is how a data-config-<entity>.xml should look like (for
example, i see no reference in the documention to a cacheName)
* and how the data-config-join.xml should should look like

Reply via email to