For what it's worth, it's also really easy to implement your own EntityProcessor. Extend from EntityProcessorBase then implement the getNext method to return a Map<String, Object> representing the row you want indexed. I did exactly this so I could use reuse my hibernate domain models to query for the data instead of sql.
Brendan On Apr 8, 2010, at 9:17 AM, Shawn Heisey wrote: > On 4/7/2010 9:26 PM, bbarani wrote: >> Hi, >> >> I am currently using DIH to index the data from a database. I am just trying >> to figure out if there are any other open source tools which I can use just >> for indexing purpose and use SOLR for querying. >> >> I also thought of writing a custom code for retrieving the data from >> database and use SOLRJ to add the data as documents in to lucene. One doubt >> here is that if I use the custom code for retrieving the data and use SOLRJ >> to commit that data, will the schema file be still used? I mean the field >> types / analyzers / tokenizers etc.. present in schema file? or do I need to >> manipulate each data (to fit to corresponding data type) in my SOLRJ >> program? >> >> > > This response is more of an answer to your earlier message where you asked > about batch importing than this exact question, but this is where the > discussion is, so I'm answering here. You could continue to use DIH and > specify the batches externally. I just actually wrote most of this in reply > to another email just a few minutes ago. > > You can pass variables into the DIH to specify the range of documents that > you want to work on, and handle the batching externally. Start with a > full-import or a delete/optimize to clear out the index and then do multiple > delta-imports. > > Here's what I'm using as the queries in my latest iteration. The > deltaImportQuery is identical to the regular query used for full-import. The > deltaQuery is just something related that returns quickly, the information is > thrown away when it does a delta-import. > > query="SELECT * FROM ${dataimporter.request.dataTable} WHERE did > > ${dataimporter.request.minDid} AND did <= ${dataimporter.request.maxDid} > AND (did % ${dataimporter.request.numShards}) IN > (${dataimporter.request.modVal})" > > deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataTable}" > > deltaImportQuery="SELECT * FROM ${dataimporter.request.dataTable} WHERE did > > ${dataimporter.request.minDid} AND did <= > ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) > IN (${dataimporter.request.modVal})"> > > Then here is my URL template: > > http://HOST:PORT/solr/CORE/dataimport?command=COMMAND&dataTable=DATATABLE&numShards=NUMSHARDS&modVal=MODVAL&minDid=MINDID&maxDid=MAXDID > > And the perl data structure that holds the replacements for the uppercase > parts: > > $urlBits = { > HOST => $cfg{'shards/inc.host1'}, > PORT => $cfg{'shards/inc.port'}, > MODVAL => $cfg{'shards/inc.modVal'}, > CORE => "live", > COMMAND => "delta-import&commit=true&optimize=false", > DATATABLE => $cfg{dataTable}, > NUMSHARDS => $cfg{numShards}, > MINDID => $cfg{maxDid}, > MAXDID => $dbMaxDid, > }; > > Good luck with your setup! > > Shawn >