Hi Gora

Also indexing 4mil + records from a MS-SQL database - index size is
about 25Gb.

I managed to solve both the performance and recovery issue by
"segmenting" the indexing process along with the
CachedSqlEntityProcessor. 

Basically I populate a temp table with a subset of primary keys (I use a
modulus of the productId to achieve this) and inner join from that table
on both the primary query and all the child queries. As a result when a
segment fails (usually also due to connectivity being interrupted) only
one segment has to be re-done. Imports are managed by a custom built
service running on the SOLR box. Its smart enough to pick up stalled
imports when polling dataimport and restart that segment.

With indexing segmented data sets become small enough for
CachedSqlEntityProcessor to load it all into RAM (the box has 8GB).
Doing this reduced indexing time from 27hours to 2.5hours! (Due to
currency changes we need a full re-index every day). I suspect that
latency kills import speed whenever there's child queries involved.
Databases are also generally much better at 1 query with 300,000 rows
than 100,000 queries with 2-4.

The 4GB (actually 3.2GB) limit only applies to the 32bit version of
Windows/SQL Server. That being said SQL server is not much of a RAM hog.
After its basic querying needs memory is only used to cache indexes and
query plans. SQL is pretty happy with 4GB but if you can upgrade the OS
another 2GB for the disk cache will help a lot. 

Regards,
Willem 

PS: You are using the JTDS driver? (http://jtds.sourceforge.net/) I find
it faster and more stable than the MS one.



-----Original Message-----
From: Gora Mohanty [mailto:g...@srijan.in] 
Sent: 10 July 2010 03:31 PM
To: solr-user@lucene.apache.org
Subject: Database connections during data import

Hi,

  We are indexing a large amount of data into Solr from a MS-SQL
database (don't ask!). There are approximately 4 million records,
and a total database size of the order of 20GB. There is also a need
for incremental updates, but these are only a few % of the total.

  After some trials-and-error, things are working great. Indexing is
a little slow as per our original expectations, but this is
probably to be expected, given that:
  * There are a fair number of queries per record indexed into Solr
  * Only one database server is in use at the moment, and this
    could well be a bottle-neck (please see below).
  * The index has many fields, and we are also storing everything
    in this phase, so that we can recover data directly from the
    Solr index.
  * Transformers are used pretty liberally
  * Finally, we are no longer so concerned about the indexing speed
    of a single Solr instance, as thanks to the possibility of
    merging indexes, we can simply throw more hardware at the
    problem.
(Incidentally, a big thank-you to everyone who has contributed to
 Solr. The above work was way easier than we had feared.)

As a complete indexing takes about 20h, sometimes the process gets
interrupted due to a loss of the database connection. I can tell
that that a loss of connection is the problem from the Solr Tomcat
logs, but it is difficult to tell whether it is the database
dropping connections (the database server is at 60-70% CPU
utilisation, but close to being maxed out at 4GB, and I am told
that MS-SQL/the OS cannot handle more RAM), or a network glitch.
What happens is that the logs report a reconnection, but the number
of processed records reported by the DataImportHandler
at /solr/dataimport?command=full-import stops incrementing, even
several hours after the reconnection. Is there any way to recover
from a reconnection, and continue DataImportHandler indexing at the
point where the process left off?

Regards,
Gora

P.S. Incidentally, would there be any interest in a
     GDataRequestHandler for Solr queries, and a
     GDataResponseWriter? We wrote one in the interests
     of trying to adhere to a de-facto standard, and can consider
     contributing these, after further testing, and cleanup.

Reply via email to