I've been troubleshooting an issue where we're trying to load
documents through DIH's URLDataSource and XPathEntityProcessor, where
we want to leverage the $hasMore feature to request to a new URL.
I've been tinkering with this using a very simple example, two XML
files -
solr.xml:
<add>
<doc>
<field name="id">SOLR1000</field>
</doc>
<doc>
<field name="id">**HASMORE**</field>
</doc>
</add>
solr2.xml
<add>
<doc>
<field name="id">SOLR2k</field>
</doc>
</add>
My DIH config is:
<?xml version="1.0"?>
<dataConfig>
<dataSource type="URLDataSource" baseUrl="file:///Users/erikhatcher/dev/solr/example/exampledocs/
"
readTimeout="180000" connectionTimeout="60000"/>
<script>
<![CDATA[
function checkForMore(row, context) {
print("### checkForMore: " + row);
if (row.get('id') == '**HASMORE**') {
print("#### hasMore ####");
row.put('$hasMore', 'true');
row.put('$nextUrl', 'file:///Users/erikhatcher/dev/solr/example/exampledocs/solr2.xml')
;
row.put('$skipRow', 'true');
} else {
row.put('$hasMore', 'false');
}
return row;
}
]]>
</script>
<document name="docs">
<entity name="doc"
processor="XPathEntityProcessor"
url="solr.xml"
forEach="/add/doc"
stream="true"
transformer
="DateFormatTransformer,TemplateTransformer,script:checkForMore"
onError="abort">
<field column="id" xpath="/add/doc/fie...@name='id']"/>
</entity>
</document>
</dataConfig>
Without the else clause in checkForMore to set $hasMore to false, an
infinite loop occurs and solr2.xml is requested repeatedly. This is
because once $hasMore is set on a row,
XPathEntityProcess#readUsefulVars sets it in entity scope and it never
gets unset. Is this intentional? Shouldn't $hasMore get reset after
more is requested?
On a related note, it would seem useful to allow $hasMore/$skipRow/
$nextUrl to be controlled from the XML data rather than solely from a
transformer. But $prefixed fields are ignored by DIH, right?
I'm still looking for that holy grail of a good example leveraging
$hasMore/$nextUrl! :)
Thanks,
Erik