Hi Gora Also indexing 4mil + records from a MS-SQL database - index size is about 25Gb.
I managed to solve both the performance and recovery issue by "segmenting" the indexing process along with the CachedSqlEntityProcessor. Basically I populate a temp table with a subset of primary keys (I use a modulus of the productId to achieve this) and inner join from that table on both the primary query and all the child queries. As a result when a segment fails (usually also due to connectivity being interrupted) only one segment has to be re-done. Imports are managed by a custom built service running on the SOLR box. Its smart enough to pick up stalled imports when polling dataimport and restart that segment. With indexing segmented data sets become small enough for CachedSqlEntityProcessor to load it all into RAM (the box has 8GB). Doing this reduced indexing time from 27hours to 2.5hours! (Due to currency changes we need a full re-index every day). I suspect that latency kills import speed whenever there's child queries involved. Databases are also generally much better at 1 query with 300,000 rows than 100,000 queries with 2-4. The 4GB (actually 3.2GB) limit only applies to the 32bit version of Windows/SQL Server. That being said SQL server is not much of a RAM hog. After its basic querying needs memory is only used to cache indexes and query plans. SQL is pretty happy with 4GB but if you can upgrade the OS another 2GB for the disk cache will help a lot. Regards, Willem PS: You are using the JTDS driver? (http://jtds.sourceforge.net/) I find it faster and more stable than the MS one. -----Original Message----- From: Gora Mohanty [mailto:g...@srijan.in] Sent: 10 July 2010 03:31 PM To: solr-user@lucene.apache.org Subject: Database connections during data import Hi, We are indexing a large amount of data into Solr from a MS-SQL database (don't ask!). There are approximately 4 million records, and a total database size of the order of 20GB. There is also a need for incremental updates, but these are only a few % of the total. After some trials-and-error, things are working great. Indexing is a little slow as per our original expectations, but this is probably to be expected, given that: * There are a fair number of queries per record indexed into Solr * Only one database server is in use at the moment, and this could well be a bottle-neck (please see below). * The index has many fields, and we are also storing everything in this phase, so that we can recover data directly from the Solr index. * Transformers are used pretty liberally * Finally, we are no longer so concerned about the indexing speed of a single Solr instance, as thanks to the possibility of merging indexes, we can simply throw more hardware at the problem. (Incidentally, a big thank-you to everyone who has contributed to Solr. The above work was way easier than we had feared.) As a complete indexing takes about 20h, sometimes the process gets interrupted due to a loss of the database connection. I can tell that that a loss of connection is the problem from the Solr Tomcat logs, but it is difficult to tell whether it is the database dropping connections (the database server is at 60-70% CPU utilisation, but close to being maxed out at 4GB, and I am told that MS-SQL/the OS cannot handle more RAM), or a network glitch. What happens is that the logs report a reconnection, but the number of processed records reported by the DataImportHandler at /solr/dataimport?command=full-import stops incrementing, even several hours after the reconnection. Is there any way to recover from a reconnection, and continue DataImportHandler indexing at the point where the process left off? Regards, Gora P.S. Incidentally, would there be any interest in a GDataRequestHandler for Solr queries, and a GDataResponseWriter? We wrote one in the interests of trying to adhere to a de-facto standard, and can consider contributing these, after further testing, and cleanup.