Re: Concurrent DB updates and delta import misses few records

2010-09-27 Thread Shawn Heisey
 You could get it from Solr, yes.  That didn't even occur to me because 
when I was designing my scripts, I didn't yet have a fully integrated 
Solr index. :)  With hindsight, I still wouldn't get it from Solr.  I 
would lose some flexibility and ease of administration.


It's certainly possible to store all build-related tracking information 
in the database.  The build system for our old search product did it 
that way.  I decided to go with simple text files in an NFS-mounted 
directory for the rewrite.  It's easier for me to administer, just ssh 
to a server and examine or modify simple one-line text files.  On the 
script side, the files get read into a Perl hash.  With the old system, 
I found it cumbersome to go through the database interfaces.  The only 
thing that's still in the database is the delete table, because it is 
populated by triggers on the metadata table.





On 9/23/2010 12:48 AM, Shashikant Kore wrote:

Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range. Am
I missing something?




RE: Concurrent DB updates and delta import misses few records

2010-09-26 Thread Ephraim Ofir
You could store the last indexed ID in the DB.  Implement the delta
import as a stored procedure that saves the last imported ID in the DB.
On subsequent delta imports, use the deltaQuery to get that ID from the
DB and use it in the deltaImportQuery
See
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3
c9f8b39cb3b7c6d4594293ea29ccf438b0174c...@icq-mail.icq.il.office.aol.com
%3e


Ephraim Ofir


-Original Message-
From: Shashikant Kore [mailto:shashik...@gmail.com] 
Sent: Thursday, September 23, 2010 8:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Concurrent DB updates and delta import misses few records

Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range.
Am
I missing something?

--shashi

On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote:

  On 9/22/2010 1:39 AM, Shashikant Kore wrote:

 Hi,

 I'm using DIH to index records from a database. After every update on
 (MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I
have
 observed that if db updates and DIH import is happening concurrently,
 import
 misses few records.

 Here is how it happens.

 The table has a column 'lastUpdated' which has default value of
current
 timestamp. Many records are added to database in a single transaction
that
 takes several seconds. For example, if 10,000 rows are being
inserted, the
 rows may get timestamp values from '2010-09-20 18:21:20' to
'2010-09-20
 18:21:26'. These rows become visible only after transaction is
committed.
 That happens at, say, '2010-09-20 18:21:30'.

 If Solr is import gets triggered at '18:20:29', it will use a
timestamp of
 last import for delta query. This import will not see the records
added in
 the aforementioned transaction as transaction was not committed at
that
 instant. After this import, the dataimport.properties will have last
index
 time as '18:20:29'.  The next import will not able to get all the
rows of
 previously referred trasaction as some of the rows have timestamp
earlier
 than '18:20:29'.

 While I am testing extreme conditions, there is a possibility of
missing
 out
 on some data.

 I could not find any solution in Solr framework to handle this. The
table
 has an auto increment key, all updates are deletes followed by
inserts.
 So,
 having last_indexed_id would have helped, where last_indexed_id is
the max
 value of id fetched in that import. The query would then become
Select id
 where idlast_indexed_id.' I suppose, Solr does not have any
provision
 like
 this.

 Two options I could think of are:
 (a) Ensure at application level that there are no concurrent DB
updates
 and
 DIH import requests going concurrently.
 (b) Use exclusive locking during DB update

 What is the best way to address this problem?


 Shashi,

 I was not solving the same problem, but perhaps you can adapt my
solution
 to yours.  My main problem was that I don't have a modified date in my
 database, and due to the size of the table, it is impractical to add
one.
  Instead, I chose to track the database primary key (a simple
autoincrement)
 outside of Solr and pass min/max values into DIH for it to use in the
SELECT
 statement.  You can see a simplified version of my entity here, with a
URL
 showing how to send the parameters in via the dataimport GET:

 http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

 The update script that runs every two minutes gets MAX(did) from the
 database, retrieves the minDid from a file on an NFS share, and runs a
 delta-import with those two values.  When the import is reported
successful,
 it writes the maxDid value to the minDid file on the network share for
the
 next run.  If the import fails, it sends an alarm and doesn't update
the
 minDid.

 Shawn




Re: Concurrent DB updates and delta import misses few records

2010-09-23 Thread Shashikant Kore
Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range. Am
I missing something?

--shashi

On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey s...@elyograg.org wrote:

  On 9/22/2010 1:39 AM, Shashikant Kore wrote:

 Hi,

 I'm using DIH to index records from a database. After every update on
 (MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
 observed that if db updates and DIH import is happening concurrently,
 import
 misses few records.

 Here is how it happens.

 The table has a column 'lastUpdated' which has default value of current
 timestamp. Many records are added to database in a single transaction that
 takes several seconds. For example, if 10,000 rows are being inserted, the
 rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
 18:21:26'. These rows become visible only after transaction is committed.
 That happens at, say, '2010-09-20 18:21:30'.

 If Solr is import gets triggered at '18:20:29', it will use a timestamp of
 last import for delta query. This import will not see the records added in
 the aforementioned transaction as transaction was not committed at that
 instant. After this import, the dataimport.properties will have last index
 time as '18:20:29'.  The next import will not able to get all the rows of
 previously referred trasaction as some of the rows have timestamp earlier
 than '18:20:29'.

 While I am testing extreme conditions, there is a possibility of missing
 out
 on some data.

 I could not find any solution in Solr framework to handle this. The table
 has an auto increment key, all updates are deletes followed by inserts.
 So,
 having last_indexed_id would have helped, where last_indexed_id is the max
 value of id fetched in that import. The query would then become Select id
 where idlast_indexed_id.' I suppose, Solr does not have any provision
 like
 this.

 Two options I could think of are:
 (a) Ensure at application level that there are no concurrent DB updates
 and
 DIH import requests going concurrently.
 (b) Use exclusive locking during DB update

 What is the best way to address this problem?


 Shashi,

 I was not solving the same problem, but perhaps you can adapt my solution
 to yours.  My main problem was that I don't have a modified date in my
 database, and due to the size of the table, it is impractical to add one.
  Instead, I chose to track the database primary key (a simple autoincrement)
 outside of Solr and pass min/max values into DIH for it to use in the SELECT
 statement.  You can see a simplified version of my entity here, with a URL
 showing how to send the parameters in via the dataimport GET:

 http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

 The update script that runs every two minutes gets MAX(did) from the
 database, retrieves the minDid from a file on an NFS share, and runs a
 delta-import with those two values.  When the import is reported successful,
 it writes the maxDid value to the minDid file on the network share for the
 next run.  If the import fails, it sends an alarm and doesn't update the
 minDid.

 Shawn




Re: Concurrent DB updates and delta import misses few records

2010-09-22 Thread Shawn Heisey

 On 9/22/2010 1:39 AM, Shashikant Kore wrote:

Hi,

I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.

Here is how it happens.

The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.

If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'.  The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.

While I am testing extreme conditions, there is a possibility of missing out
on some data.

I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become Select id
where idlast_indexed_id.' I suppose, Solr does not have any provision like
this.

Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update

What is the best way to address this problem?


Shashi,

I was not solving the same problem, but perhaps you can adapt my 
solution to yours.  My main problem was that I don't have a modified 
date in my database, and due to the size of the table, it is impractical 
to add one.  Instead, I chose to track the database primary key (a 
simple autoincrement) outside of Solr and pass min/max values into DIH 
for it to use in the SELECT statement.  You can see a simplified version 
of my entity here, with a URL showing how to send the parameters in via 
the dataimport GET:


http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

The update script that runs every two minutes gets MAX(did) from the 
database, retrieves the minDid from a file on an NFS share, and runs a 
delta-import with those two values.  When the import is reported 
successful, it writes the maxDid value to the minDid file on the network 
share for the next run.  If the import fails, it sends an alarm and 
doesn't update the minDid.


Shawn