Hi, I'm using DIH to index records from a database. After every update on (MySQL) DB, Solr DIH is invoked for delta import. In my tests, I have observed that if db updates and DIH import is happening concurrently, import misses few records.
Here is how it happens. The table has a column 'lastUpdated' which has default value of current timestamp. Many records are added to database in a single transaction that takes several seconds. For example, if 10,000 rows are being inserted, the rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20 18:21:26'. These rows become visible only after transaction is committed. That happens at, say, '2010-09-20 18:21:30'. If Solr is import gets triggered at '18:20:29', it will use a timestamp of last import for delta query. This import will not see the records added in the aforementioned transaction as transaction was not committed at that instant. After this import, the dataimport.properties will have last index time as '18:20:29'. The next import will not able to get all the rows of previously referred trasaction as some of the rows have timestamp earlier than '18:20:29'. While I am testing extreme conditions, there is a possibility of missing out on some data. I could not find any solution in Solr framework to handle this. The table has an auto increment key, all updates are deletes followed by inserts. So, having last_indexed_id would have helped, where last_indexed_id is the max value of id fetched in that import. The query would then become "Select id where id>last_indexed_id.' I suppose, Solr does not have any provision like this. Two options I could think of are: (a) Ensure at application level that there are no concurrent DB updates and DIH import requests going concurrently. (b) Use exclusive locking during DB update What is the best way to address this problem? Thank you, --shashi