delta-import and cache (a story in conflict)

Keith Naas Tue, 14 May 2013 12:24:21 -0700

Thanks for all the great work on Solr. We have used it for over a year and have 
been very satisfied with it.


However ,we have noticed that some of the recent changes have affected import 
caching in a not so good way.  We are using Solr 4.2.0.

We use full and delta imports.  We only use a delta import query on the root 
entity (our object model does not safely support updates to the nested 
entities).

Here is a snippet of the xml.

<entity name="product" pk="ID" query="..." deltaImportQuery="..." 
deltaQuery="..." deletedPkQuery="..." >
 <field column="ID" name="id" />

<field column="NAME" name="name" />
   ...
   <entity name="productSize"
                    query="..."
                    processor="CachedSqlEntityProcessor" cacheKey="PRODUCT_ID" 
cacheLookup="product.ID">
                <entity name="productSizeAttributes"
                        query="..." processor="CachedSqlEntityProcessor"  
cacheKey="SIZE_ID" cacheLookup="productSize.SIZE_ID"
                        transformer="LogTransformer"
                        logLevel="info" logTemplate="The size for product 
${product.ID} is ${productSizeAttributes}">
                    <field column="SIZE_ID" name="size" />
                    <field column="SIZE_NAME" name="sizeName" />
                    <field column="SIZE_CODE" name="sizeCode"/>
                </entity>
            </entity>
</entity>

We have noticed that delta imports that used to take 30 seconds now run 
indefinitely and eventually cause an OutOfMemory condition on a huge multi GB 
Heap.  Here is the Stack Trace.

java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuilder.append(StringBuilder.java:119)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at java.util.AbstractCollection.toString(AbstractCollection.java:422)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at 
org.apache.solr.common.SolrInputField.toString(SolrInputField.java:215)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at java.util.AbstractCollection.toString(AbstractCollection.java:422)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at 
org.apache.solr.common.SolrInputDocument.toString(SolrInputDocument.java:192)
        at java.lang.String.valueOf(String.java:2826)
        at java.lang.StringBuilder.append(StringBuilder.java:115)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:524)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
        at 
org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:353)
        at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:219)
        at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:451)
        at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:489)
        at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)

DocBuilder.buildDocument line 354 in Solr 4.2.0:   SolrException.log(LOG, 
"Exception while processing: " + epw.getEntity().getName() + " document : " + 
doc, e);

The doc.toString is appending all SolrInputFields to the string.  Why are the 
SolrInputFields so big?

It is hard to say because the original exception is not logged.  After 
debugging for a few days it appears that during a delta-import the cache is 
destroyed prematurely.

DocBuilder.buildDocument is called for each row returned by the deltaQuery.  In 
the finally block of buildDocument it calls destroy on all 
EntityProcessorWrapper's.  This eventually calls destroy on EntityProcessorBase 
which after destroying the cacheSupport, sets cacheSupport to null.  For all 
other buildDocument calls, EntityProcessorBase.init() is eventually executed.  
This looks at the isFirstInit flag (which is false) and skips re-initializing 
the cache (which likely should never have been destroyed except on the last row 
returned by the deltaQuery).

Finally when the rows for the nested entities are fetched, it skips the cache 
behavior, re-executes the SQL and loads every single row form the nested 
entities as new fields in each document.

Thus if a query returned 100000 productSize records every product after the 
first would end up with all 100000 productSizes attached to it.

This behavior makes delta-imports unusable when caching is utilized in any 
release after this functionality was changed.

We have also noticed that caching does not seem to be honored when the SQL 
statement contains resolvable tokens ${}.  However, we can workaround those 2 
queries by disabling caching.  I cannot disable caching on the other 20 
queries.  Imports would take hours.

Has anyone else seen this?

Keith Naas

delta-import and cache (a story in conflict)

Reply via email to