Maarten, Here is a sample set-up that lets you build your caches in parallel and then index off the caches in a subsequent step. See below for the solrconfig.xml snippet and the text of the 4 data-config.xml files. In this example it builds a cache for the parent also, but this is not necessary. But I guess its cleaner looking to just cache everything and then the final step works against caches only.
Here's how it works. First, begin a full import for each of the cache builders by issuing these commands all at once. Each of these builds a cache: /solrcore/dih-parent?command=full-import /solrcore/dih-child1?command=full-import /solrcore/dih-child2?command=full-import You then need to poll each of these handler's status screen and wait until they all finish. Once done, issue this command. This reads back the caches and indexes the data to your solr core: /solrcore/dih-master?command=full-import The tricky thing here is automating it all. You'll need something that issues the commands and then polls the responses, etc. For my case, I ended up writing a very hacky program that runs 12 cache-building handlers at once, starting a new one when one finishes, until all 50 or so are complete. It then runs the master dih handlers (an additional complexity for our situation, not shown here, is I'm using the DIH Cache partitioning feature to make multiple partitions, then I have multiple master handlers that each index a slice of the data at the same time, making the "master" step finish faster on a multi-processor machine) Another thing that is very confusing with all this is that to build the caches, you send all the cache params as request parameters, included in solrconfig.xml here. But for the master indexing, these are parameters on the entity in data-config.xml. It would be better (perhaps should this feature ever get committed) maybe if this changed to allow all the configuration to occur in data-config.xml for both building caches and reading caches. One last thing is you might want to open a JIRA issue about JDBCDataSource not honoring the JDBC Driver parameter that you're trying to pass through. https://issues.apache.org/jira/browse/SOLR If you don't have an account you need to create one to open a new issue. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 <!-- 4 handlers declared in solrconfig.xml --> <requestHandler name="/dih-parent" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dataconfig-parent.xml</str> <str name="clean">true</str> <str name="persistCacheBaseDir">/path/to/caches</str> <str name="persistCacheName">PARENT</str> <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str> <!-- ID is Oracle's "number" type which the JDBC driver brings in as a BigDecimal. The field always contains an Integer so we can optimize for that case See org.apache.solr.handler.dataimport.DIHCacheTypes --> <str name="persistCacheFieldNames">ID, SOME_DATA</str> <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str> <str name="cacheKey">ID</str> <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str> <!-- all bdb-je caches being built at the same time share this 100mb cache --> <str name="berkleyInternalCacheSize">100000000</str> <str name="berkleyInternalShared">true</str> </lst> </requestHandler> <requestHandler name="/dih-child1" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dataconfig-child1.xml</str> <str name="clean">true</str> <str name="persistCacheBaseDir">/path/to/caches</str> <str name="persistCacheName">CHILD1</str> <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str> <str name="persistCacheFieldNames">PARENT_ID, CHILD_ONE_DATA</str> <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str> <str name="cacheKey">PARENT_ID</str> <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str> <str name="berkleyInternalCacheSize">100000000</str> <str name="berkleyInternalShared">true</str> </lst> </requestHandler> <requestHandler name="/dih-child2" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dataconfig-child2.xml</str> <str name="clean">true</str> <str name="persistCacheBaseDir">/path/to/caches</str> <str name="persistCacheName">CHILD2</str> <str name="persistCacheImpl">org.apache.solr.handler.dataimport.BerkleyBackedCache</str> <str name="persistCacheFieldNames">PARENT_ID, CHILD_TWO_DATA</str> <str name="persistCacheFieldTypes">BIGDECIMAL_INTEGER, STRING</str> <str name="cacheKey">PARENT_ID</str> <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str> <str name="berkleyInternalCacheSize">100000000</str> <str name="berkleyInternalShared">true</str> </lst> </requestHandler> <requestHandler name="/dih-master" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dataconfig-master.xml</str> <str name="clean">true</str> <str name="commit">true</str> <str name="optimize">false</str> </lst> </requestHandler> <!-- dataconfig-parent.xml --> <dataConfig> <dataSource name="zzz" driver="xxx" url="ccc" /> <document name="PARENT"> <entity name="PARENT" dataSource="zzz" query="SELECT ID, SOME_DATA FROM PARENT" /> </document> </dataConfig> <!-- dataconfig-child1.xml --> <dataConfig> <dataSource name="zzz" driver="xxx" url="ccc" /> <document name="CHILD1"> <entity name="CHILD1" dataSource="zzz" query="SELECT PARENT_ID, CHILD_ONE_DATA FROM CHILD1" /> </document> </dataConfig> <!-- dataconfig-child2.xml --> <dataConfig> <dataSource name="zzz" driver="xxx" url="ccc" /> <document name="CHILD2"> <entity name="CHILD2" dataSource="zzz" query="SELECT PARENT_ID, CHILD_TWO_DATA FROM CHILD2" /> </document> </dataConfig> <!-- dataconfig-master.xml --> <dataConfig> <document name="MASTER"> <entity name="PARENT" processor="org.apache.solr.handler.dataimport.DIHCacheProcessor" cacheKey="ID" persistCacheBaseDir="/path/to/caches" persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache" persistCacheName="PARENT" berkleyInternalCacheSize="100000000" <!-- all bdb-je caches share this 100mb cache --> berkleyInternalShared="true" > <entity name="CHILD1" processor="org.apache.solr.handler.dataimport.DIHCacheProcessor" cacheKey="PARENT_ID" cacheLookup="PARENT.ID" persistCacheBaseDir="/path/to/caches" persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache" persistCacheName="CHILD1" berkleyInternalCacheSize="100000000" berkleyInternalShared="true" /> <entity name="CHILD2" processor="org.apache.solr.handler.dataimport.DIHCacheProcessor" cacheKey="PARENT_ID" cacheLookup="PARENT.ID" persistCacheBaseDir="/path/to/caches" persistCacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache" persistCacheName="CHILD2" berkleyInternalCacheSize="100000000" berkleyInternalShared="true" /> </entity> </document> </dataConfig> -----Original Message----- From: mroosendaal [mailto:mroosend...@yahoo.com] Sent: Friday, November 16, 2012 8:19 AM To: solr-user@lucene.apache.org Subject: RE: DIH nested entities don't work Hi, You are correct about not wanting to index everything every day, however for this PoC i need a 'bootstrap' mechanism which basically does what Endeca does. The 'defaultRowPrefetch' in the solrconfig.xml does not seem to take, i'll have a closer look. With the long time, it appeard that one of the views i was reading was also by far the biggest with over 4mln entries. Other views should take much less time. With regards to the parallel processing, i have the 2 classes you mention and packaged them. The documentation in the patch was not clear on how to exactly do that. My assumption is that * for every entity you have to define a DIH in the solrconfig and refer to aspecific data-config-<entity>.xml * define 1 importhandler for the join in the solrconfig * what isn't clear is how a data-config-<entity>.xml should look like (for example, i see no reference in the documention to a cacheName) * and how the data-config-join.xml should should look like