Hi,

I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.

Here is how it happens.

The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.

If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'.  The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.

While I am testing extreme conditions, there is a possibility of missing out
on some data.

I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become "Select id
where id>last_indexed_id.' I suppose, Solr does not have any provision like
this.

Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update

What is the best way to address this problem?

Thank you,

--shashi

Reply via email to