On Mon, 12 Jul 2010 09:20:05 +0200 "Willem Van Riet" <willem.vanr...@sa.24.com> wrote:
> Hi Gora > > Also indexing 4mil + records from a MS-SQL database - index size > is about 25Gb. Thanks for some great pointers. More detailed responses below. > I managed to solve both the performance and recovery issue by > "segmenting" the indexing process along with the > CachedSqlEntityProcessor. > Basically I populate a temp table with a subset of primary keys > (I use a modulus of the productId to achieve this) and inner join > from that table on both the primary query and all the child > queries. [...] Thanks for that pointer. I had read about the CachedSqlEntityProcessor, but my eyes must have been glazing over at that point. That sounds like a great possibility, especially your point on breaking up the data into chunks small enough to fit into physical RAM. We came up with something of a brute-force solution. We discovered that indexing on each of several cores on a single multi-core Solr instance was comparably fast to indexing on separate Solr instances. So, we have broken up our hardware into 15 cores on five Solr instances (three/instance seems to peg the CPU on each Solr server at ~80%), and two MS-SQL database servers, and seem to be down to about 6 hours for indexing (scaling almost exactly by the number of cores). Tomorrow, we plan to bring online another five Solr instances, and a third database server, in order to halve that time. Beyond that, we are probably going to something like Amazon. > The 4GB (actually 3.2GB) limit only applies to the 32bit version > of Windows/SQL Server. That being said SQL server is not much of > a RAM hog. After its basic querying needs memory is only used to > cache indexes and query plans. SQL is pretty happy with 4GB but > if you can upgrade the OS another 2GB for the disk cache will > help a lot. [...] Yes, it turns out that I was (somewhat) unwarrantedly bad-mouthing Microsoft. The database server stands up quite well in terms of CPU usage, though 3-4 Solr DIH instances hitting the DB seem to get up to the RAM limit almost at once. Unfortunately, upgrading the OS is not an option at the moment, but the database server is hardly the bottle-neck now. > PS: You are using the JTDS driver? (http://jtds.sourceforge.net/) > I find it faster and more stable than the MS one. Oh, saw that driver, but did not know that it was better than the MS one. Thanks for the tip. Regards, Gora Gora