Hi, We are indexing a large amount of data into Solr from a MS-SQL database (don't ask!). There are approximately 4 million records, and a total database size of the order of 20GB. There is also a need for incremental updates, but these are only a few % of the total.
After some trials-and-error, things are working great. Indexing is a little slow as per our original expectations, but this is probably to be expected, given that: * There are a fair number of queries per record indexed into Solr * Only one database server is in use at the moment, and this could well be a bottle-neck (please see below). * The index has many fields, and we are also storing everything in this phase, so that we can recover data directly from the Solr index. * Transformers are used pretty liberally * Finally, we are no longer so concerned about the indexing speed of a single Solr instance, as thanks to the possibility of merging indexes, we can simply throw more hardware at the problem. (Incidentally, a big thank-you to everyone who has contributed to Solr. The above work was way easier than we had feared.) As a complete indexing takes about 20h, sometimes the process gets interrupted due to a loss of the database connection. I can tell that that a loss of connection is the problem from the Solr Tomcat logs, but it is difficult to tell whether it is the database dropping connections (the database server is at 60-70% CPU utilisation, but close to being maxed out at 4GB, and I am told that MS-SQL/the OS cannot handle more RAM), or a network glitch. What happens is that the logs report a reconnection, but the number of processed records reported by the DataImportHandler at /solr/dataimport?command=full-import stops incrementing, even several hours after the reconnection. Is there any way to recover from a reconnection, and continue DataImportHandler indexing at the point where the process left off? Regards, Gora P.S. Incidentally, would there be any interest in a GDataRequestHandler for Solr queries, and a GDataResponseWriter? We wrote one in the interests of trying to adhere to a de-facto standard, and can consider contributing these, after further testing, and cleanup.