Hi,

  We are indexing a large amount of data into Solr from a MS-SQL
database (don't ask!). There are approximately 4 million records,
and a total database size of the order of 20GB. There is also a need
for incremental updates, but these are only a few % of the total.

  After some trials-and-error, things are working great. Indexing is
a little slow as per our original expectations, but this is
probably to be expected, given that:
  * There are a fair number of queries per record indexed into Solr
  * Only one database server is in use at the moment, and this
    could well be a bottle-neck (please see below).
  * The index has many fields, and we are also storing everything
    in this phase, so that we can recover data directly from the
    Solr index.
  * Transformers are used pretty liberally
  * Finally, we are no longer so concerned about the indexing speed
    of a single Solr instance, as thanks to the possibility of
    merging indexes, we can simply throw more hardware at the
    problem.
(Incidentally, a big thank-you to everyone who has contributed to
 Solr. The above work was way easier than we had feared.)

As a complete indexing takes about 20h, sometimes the process gets
interrupted due to a loss of the database connection. I can tell
that that a loss of connection is the problem from the Solr Tomcat
logs, but it is difficult to tell whether it is the database
dropping connections (the database server is at 60-70% CPU
utilisation, but close to being maxed out at 4GB, and I am told
that MS-SQL/the OS cannot handle more RAM), or a network glitch.
What happens is that the logs report a reconnection, but the number
of processed records reported by the DataImportHandler
at /solr/dataimport?command=full-import stops incrementing, even
several hours after the reconnection. Is there any way to recover
from a reconnection, and continue DataImportHandler indexing at the
point where the process left off?

Regards,
Gora

P.S. Incidentally, would there be any interest in a
     GDataRequestHandler for Solr queries, and a
     GDataResponseWriter? We wrote one in the interests
     of trying to adhere to a de-facto standard, and can consider
     contributing these, after further testing, and cleanup.

Reply via email to