On 3/26/2015 5:19 PM, Julian Perry wrote: > I have an index with, say, 10 fields. > > I load that index directly from Oracle - data-config.xml using > JDBC. I can load 10 million rows very quickly. This direct > way of loading from Oracle straight into SOLR is fantastic - > really efficient and saves writing loads of import/export code > (e.g. via a CSV file). > > Of those 10 fields - two of them (set to multiValued) come from > a separate table and there are anything from 1 to 10 rows per > row from the main table. > > I can use a nested entity to extract the child rows for each of > the 10m rows in the main table - but then SOLR generates 10m > separate SQL calls - and the load time goes from a few minutes > to several days. > > On smaller tables - just a few thousand rows - I can use a > second nested entity with a JDBC call - but not for very large > tables. > > Could I load the data in two steps: > 1) load the main 10m rows > 2) load into the existing index by adding the data from a > second SQL call into fields for each existing row (i.e. > an UPDATE instead of an INSERT). > > I don't know what syntax/option might achieve that. There > is incremental loading - but I think that replaces whole rows > rather then updating individual fields. Or maybe it does > do both?
If those child tables do not have a large number of entries, you can configure caching on the inner entities so that the information doesn't need to actually be requested from the database server. If there are a large number of entries, then that may not be possible due to memory constraints. https://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor If that's not practical, then the only real option you have is to drop back to one entity, and build a single SELECT statement (using JOIN and some form of CONCAT) that will gather all the information from all the tables at the same time, and combine multiple values together into one SQL result field with some kind of delimiter. Then you can use the RegexTransformer's "splitBy" functionality to turn the concatenated data back into multiple values for your multi-valued field. Database servers tend to be REALLY good at JOIN operations, so the database would be doing the heavy lifting. https://wiki.apache.org/solr/DataImportHandler#RegexTransformer Solr does have an equivalent concept to SQL's UPDATE, but there are enough caveats to using it that it may not be a good option: https://wiki.apache.org/solr/Atomic_Updates Thanks, Shawn