You could store the last indexed ID in the DB.  Implement the delta
import as a stored procedure that saves the last imported ID in the DB.
On subsequent delta imports, use the deltaQuery to get that ID from the
DB and use it in the deltaImportQuery
See
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3
c9f8b39cb3b7c6d4594293ea29ccf438b0174c...@icq-mail.icq.il.office.aol.com
%3e


Ephraim Ofir


-----Original Message-----
From: Shashikant Kore [mailto:shashik...@gmail.com] 
Sent: Thursday, September 23, 2010 8:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Concurrent DB updates and delta import misses few records

Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range.
Am
I missing something?

--shashi

On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey <s...@elyograg.org> wrote:

>  On 9/22/2010 1:39 AM, Shashikant Kore wrote:
>
>> Hi,
>>
>> I'm using DIH to index records from a database. After every update on
>> (MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I
have
>> observed that if db updates and DIH import is happening concurrently,
>> import
>> misses few records.
>>
>> Here is how it happens.
>>
>> The table has a column 'lastUpdated' which has default value of
current
>> timestamp. Many records are added to database in a single transaction
that
>> takes several seconds. For example, if 10,000 rows are being
inserted, the
>> rows may get timestamp values from '2010-09-20 18:21:20' to
'2010-09-20
>> 18:21:26'. These rows become visible only after transaction is
committed.
>> That happens at, say, '2010-09-20 18:21:30'.
>>
>> If Solr is import gets triggered at '18:20:29', it will use a
timestamp of
>> last import for delta query. This import will not see the records
added in
>> the aforementioned transaction as transaction was not committed at
that
>> instant. After this import, the dataimport.properties will have last
index
>> time as '18:20:29'.  The next import will not able to get all the
rows of
>> previously referred trasaction as some of the rows have timestamp
earlier
>> than '18:20:29'.
>>
>> While I am testing extreme conditions, there is a possibility of
missing
>> out
>> on some data.
>>
>> I could not find any solution in Solr framework to handle this. The
table
>> has an auto increment key, all updates are deletes followed by
inserts.
>> So,
>> having last_indexed_id would have helped, where last_indexed_id is
the max
>> value of id fetched in that import. The query would then become
"Select id
>> where id>last_indexed_id.' I suppose, Solr does not have any
provision
>> like
>> this.
>>
>> Two options I could think of are:
>> (a) Ensure at application level that there are no concurrent DB
updates
>> and
>> DIH import requests going concurrently.
>> (b) Use exclusive locking during DB update
>>
>> What is the best way to address this problem?
>>
>
> Shashi,
>
> I was not solving the same problem, but perhaps you can adapt my
solution
> to yours.  My main problem was that I don't have a modified date in my
> database, and due to the size of the table, it is impractical to add
one.
>  Instead, I chose to track the database primary key (a simple
autoincrement)
> outside of Solr and pass min/max values into DIH for it to use in the
SELECT
> statement.  You can see a simplified version of my entity here, with a
URL
> showing how to send the parameters in via the dataimport GET:
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html
>
> The update script that runs every two minutes gets MAX(did) from the
> database, retrieves the minDid from a file on an NFS share, and runs a
> delta-import with those two values.  When the import is reported
successful,
> it writes the maxDid value to the minDid file on the network share for
the
> next run.  If the import fails, it sends an alarm and doesn't update
the
> minDid.
>
> Shawn
>
>

Reply via email to