On 4/7/2010 9:26 PM, bbarani wrote:
Hi,
I am currently using DIH to index the data from a database. I am just trying
to figure out if there are any other open source tools which I can use just
for indexing purpose and use SOLR for querying.
I also thought of writing a custom code for retrieving the data from
database and use SOLRJ to add the data as documents in to lucene. One doubt
here is that if I use the custom code for retrieving the data and use SOLRJ
to commit that data, will the schema file be still used? I mean the field
types / analyzers / tokenizers etc.. present in schema file? or do I need to
manipulate each data (to fit to corresponding data type) in my SOLRJ
program?
This response is more of an answer to your earlier message where you
asked about batch importing than this exact question, but this is where
the discussion is, so I'm answering here. You could continue to use DIH
and specify the batches externally. I just actually wrote most of this
in reply to another email just a few minutes ago.
You can pass variables into the DIH to specify the range of documents
that you want to work on, and handle the batching externally. Start
with a full-import or a delete/optimize to clear out the index and then
do multiple delta-imports.
Here's what I'm using as the queries in my latest iteration. The
deltaImportQuery is identical to the regular query used for
full-import. The deltaQuery is just something related that returns
quickly, the information is thrown away when it does a delta-import.
query="SELECT * FROM ${dataimporter.request.dataTable} WHERE did >
${dataimporter.request.minDid} AND did <=
${dataimporter.request.maxDid} AND (did %
${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})"
deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataTable}"
deltaImportQuery="SELECT * FROM ${dataimporter.request.dataTable} WHERE
did > ${dataimporter.request.minDid} AND did <=
${dataimporter.request.maxDid} AND (did %
${dataimporter.request.numShards}) IN (${dataimporter.request.modVal})">
Then here is my URL template:
http://HOST:PORT/solr/CORE/dataimport?command=COMMAND&dataTable=DATATABLE&numShards=NUMSHARDS&modVal=MODVAL&minDid=MINDID&maxDid=MAXDID
And the perl data structure that holds the replacements for the
uppercase parts:
$urlBits = {
HOST => $cfg{'shards/inc.host1'},
PORT => $cfg{'shards/inc.port'},
MODVAL => $cfg{'shards/inc.modVal'},
CORE => "live",
COMMAND => "delta-import&commit=true&optimize=false",
DATATABLE => $cfg{dataTable},
NUMSHARDS => $cfg{numShards},
MINDID => $cfg{maxDid},
MAXDID => $dbMaxDid,
};
Good luck with your setup!
Shawn