On Mon, 12 Jul 2010 09:20:05 +0200
"Willem Van Riet" <willem.vanr...@sa.24.com> wrote:

> Hi Gora
> 
> Also indexing 4mil + records from a MS-SQL database - index size
> is about 25Gb.

Thanks for some great pointers. More detailed responses below.

> I managed to solve both the performance and recovery issue by
> "segmenting" the indexing process along with the
> CachedSqlEntityProcessor. 

> Basically I populate a temp table with a subset of primary keys
> (I use a modulus of the productId to achieve this) and inner join
> from that table on both the primary query and all the child
> queries.
[...]

Thanks for that pointer. I had read about the
CachedSqlEntityProcessor, but my eyes must have been glazing over
at that point. That sounds like a great possibility, especially
your point on breaking up the data into chunks small enough to fit
into physical RAM.

We came up with something of a brute-force solution. We discovered
that indexing on each of several cores on a single multi-core Solr
instance was comparably fast to indexing on separate Solr
instances. So, we have broken up our hardware into 15 cores on five
Solr instances (three/instance seems to peg the CPU on each Solr
server at ~80%), and two MS-SQL database servers, and seem to be
down to about 6 hours for indexing (scaling almost exactly by the
number of cores). Tomorrow, we plan to bring online another five
Solr instances, and a third database server, in order to halve that
time. Beyond that, we are probably going to something like Amazon.

> The 4GB (actually 3.2GB) limit only applies to the 32bit version
> of Windows/SQL Server. That being said SQL server is not much of
> a RAM hog. After its basic querying needs memory is only used to
> cache indexes and query plans. SQL is pretty happy with 4GB but
> if you can upgrade the OS another 2GB for the disk cache will
> help a lot. 
[...]

Yes, it turns out that I was (somewhat) unwarrantedly bad-mouthing
Microsoft. The database server stands up quite well in terms of CPU
usage, though 3-4 Solr DIH instances hitting the DB seem to get up
to the RAM limit almost at once. Unfortunately, upgrading the OS is
not an option at the moment, but the database server is hardly the
bottle-neck now.

> PS: You are using the JTDS driver? (http://jtds.sourceforge.net/)
> I find it faster and more stable than the MS one.

Oh, saw that driver, but did not know that it was better than the
MS one. Thanks for the tip.

Regards,
Gora
Gora

Reply via email to