Re: Is there any other tool other than DIH to index a database

Brendan Grainger Thu, 08 Apr 2010 06:50:47 -0700

For what it's worth, it's also really easy to implement your own 
EntityProcessor. Extend from EntityProcessorBase then implement the getNext 
method to return a Map<String, Object> representing the row you want indexed. I 
did exactly this so I could use reuse my hibernate domain models to query for 
the data instead of sql.


Brendan

On Apr 8, 2010, at 9:17 AM, Shawn Heisey wrote:

> On 4/7/2010 9:26 PM, bbarani wrote:
>> Hi,
>> 
>> I am currently using DIH to index the data from a database. I am just trying
>> to figure out if there are any other open source tools which I can use just
>> for indexing purpose and use SOLR for querying.
>> 
>> I also thought of writing a custom code for retrieving the data from
>> database and use SOLRJ to add the data as documents in to lucene. One doubt
>> here is that if I use the custom code for retrieving the data and use SOLRJ
>> to commit that data, will the schema file be still used? I mean the field
>> types / analyzers / tokenizers etc.. present in schema file? or do I need to
>> manipulate each data (to fit to corresponding data type) in my SOLRJ
>> program?
>> 
>>   
> 
> This response is more of an answer to your earlier message where you asked 
> about batch importing than this exact question, but this is where the 
> discussion is, so I'm answering here.  You could continue to use DIH and 
> specify the batches externally.  I just actually wrote most of this in reply 
> to another email just a few minutes ago.
> 
> You can pass variables into the DIH to specify the range of documents that 
> you want to work on, and handle the batching externally.  Start with a 
> full-import or a delete/optimize to clear out the index and then do multiple 
> delta-imports.
> 
> Here's what I'm using as the queries in my latest iteration.  The 
> deltaImportQuery is identical to the regular query used for full-import.  The 
> deltaQuery is just something related that returns quickly, the information is 
> thrown away when it does a delta-import.
> 
> query="SELECT * FROM ${dataimporter.request.dataTable} WHERE did &gt; 
> ${dataimporter.request.minDid} AND did &lt;= ${dataimporter.request.maxDid} 
> AND (did % ${dataimporter.request.numShards}) IN 
> (${dataimporter.request.modVal})"
> 
> deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataTable}"
> 
> deltaImportQuery="SELECT * FROM ${dataimporter.request.dataTable} WHERE did 
> &gt; ${dataimporter.request.minDid} AND did &lt;= 
> ${dataimporter.request.maxDid} AND (did % ${dataimporter.request.numShards}) 
> IN (${dataimporter.request.modVal})">
> 
> Then here is my URL template:
> 
> http://HOST:PORT/solr/CORE/dataimport?command=COMMAND&dataTable=DATATABLE&numShards=NUMSHARDS&modVal=MODVAL&minDid=MINDID&maxDid=MAXDID
> 
> And the perl data structure that holds the replacements for the uppercase 
> parts:
> 
> $urlBits = {
>  HOST => $cfg{'shards/inc.host1'},
>  PORT => $cfg{'shards/inc.port'},
>  MODVAL => $cfg{'shards/inc.modVal'},
>  CORE => "live",
>  COMMAND => "delta-import&commit=true&optimize=false",
>  DATATABLE => $cfg{dataTable},
>  NUMSHARDS => $cfg{numShards},
>  MINDID => $cfg{maxDid},
>  MAXDID => $dbMaxDid,
> };
> 
> Good luck with your setup!
> 
> Shawn
>

Re: Is there any other tool other than DIH to index a database

Reply via email to