On 4/7/2010 9:26 PM, bbarani wrote:
Hi,

I am currently using DIH to index the data from a database. I am just trying
to figure out if there are any other open source tools which I can use just
for indexing purpose and use SOLR for querying.

I also thought of writing a custom code for retrieving the data from
database and use SOLRJ to add the data as documents in to lucene. One doubt
here is that if I use the custom code for retrieving the data and use SOLRJ
to commit that data, will the schema file be still used? I mean the field
types / analyzers / tokenizers etc.. present in schema file? or do I need to
manipulate each data (to fit to corresponding data type) in my SOLRJ
program?


This response is more of an answer to your earlier message where you asked about batch importing than this exact question, but this is where the discussion is, so I'm answering here. You could continue to use DIH and specify the batches externally. I just actually wrote most of this in reply to another email just a few minutes ago.

You can pass variables into the DIH to specify the range of documents that you want to work on, and handle the batching externally. Start with a full-import or a delete/optimize to clear out the index and then do multiple delta-imports.

Here's what I'm using as the queries in my latest iteration. The deltaImportQuery is identical to the regular query used for full-import. The deltaQuery is just something related that returns quickly, the information is thrown away when it does a delta-import.

query="SELECT * FROM ${dataimporter.request.dataTable} WHERE did > ${dataimporter.request.minDid} AND did <= ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})"

deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataTable}"

deltaImportQuery="SELECT * FROM ${dataimporter.request.dataTable} WHERE did > ${dataimporter.request.minDid} AND did <= ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})">

Then here is my URL template:

http://HOST:PORT/solr/CORE/dataimport?command=COMMAND&dataTable=DATATABLE&numShards=NUMSHARDS&modVal=MODVAL&minDid=MINDID&maxDid=MAXDID

And the perl data structure that holds the replacements for the uppercase parts:

$urlBits = {
  HOST => $cfg{'shards/inc.host1'},
  PORT => $cfg{'shards/inc.port'},
  MODVAL => $cfg{'shards/inc.modVal'},
  CORE => "live",
  COMMAND => "delta-import&commit=true&optimize=false",
  DATATABLE => $cfg{dataTable},
  NUMSHARDS => $cfg{numShards},
  MINDID => $cfg{maxDid},
  MAXDID => $dbMaxDid,
};

Good luck with your setup!

Shawn

Reply via email to