Thanks for your time responding to this. I have decided also to go down the route of cron-scheduled Perl LWP pings to DIH + deltaQueries. This seems to work inline with what the business requires and for the index size.
Thanks again On Oct 7, 2010, at 7:46 AM, Shawn Heisey wrote: > On 10/6/2010 10:49 AM, Allistair Crossley wrote: >> Hi, >> >> I was interested in gaining some insight into how you guys schedule updates >> for your Solr index (I have a single index). >> >> Right now during development I have added deltaQuery specifications to data >> import entities to control the number of rows being queries on re-indexes. >> >> However in terms of *when* to reindex we have a lot going on in the system - >> there are 4 sub-systems: custom application data, a CMS, a forum and a blog. >> It's all being indexed and at any given time there will be users and >> administrators all updating various parts of the sub-systems. >> >> For the time being during development I have been issuing reindexes to the >> data import handler on each CRUD on any given sub-system. This has been >> working fine to be honest. It does need to be as immediate as possible - a >> scheduled update won't work for us. Even every 10 minutes is probably not >> fast enough. >> >> So I wonder what others do. Is anyone else in a similar situation? >> >> And what happens if 4 users generate 4 different requests to the data import >> handler to update for different types of data? The DIH will be running >> already let's say for request 1, then request 2 comes in - is it rejected? >> Or is it queued? >> >> I need it to be queued and serviced because the request 1 re-index may have >> already run its queries but missed the data added by the user for request 2. >> Same then goes for the requests 3 and 4. > > I can't say whether the DIH will properly handle concurrent requests or not. > I figure it's always best to assume that things like this won't work and find > an elegant way to design around it. > > I wrote my build system in perl (using LWP and LWP::Simple), and assumed that > the DIH would not let me run concurrent delta-imports. We settled on every > two minutes for our update frequency, and use cron for scheduling. Two of my > servers (VMs, actually) are a heartbeat cluster running HAProxy for load > balancing, which I implemented purely for redundancy, not for scalability. > Whichever host in the heartbeat cluster is online is the one that runs the > cronjobs. > > I have the following processes and schedules: > > idxUpdate: Runs every two minutes. This script imports new data, based on an > autoincrement primary key in the database, the field is DID. From the > database perspective, changed data looks like new data - it gets its DID > updated but another unique field (TAG_ID) stays the same. Solr uses TAG_ID > as its uniqueKey. Updates go into an incremental shard that is relatively > small - usually less than 1GB and 500,000 documents. At the top of the hour, > the update includes a call to optimize. > > idxDelete: Runs every ten minutes starting at xx:01. This script gets the > list of newly deleted documents by DID. Then, 1024 of them at a time, it > queries every shard for this list and issues a delete if they are found. > After the entire list is complete, it issues a commit to any shard that was > actually changed. This increases the lifespan of indexSearchers and Solr > caches. At the top of each hour, it reads the entire list of deletes instead > of new ones, and trims the delete list to the last 48 hours. > > idxRrdUpdate: Runs once an hour. This simply records the current MAX(DID) > from the database into an RRD database. I keep it in both a counter and a > gauge. One day I will track other statistical data about my system and make > it all into pretty graphs. > > idxDistribute: Runs once a day. This uses the historical data in the RRD > database to decide which incremental data is older than one week. Once it > has that information (a DID range), it distributes those records to each of > the six static index shards and deletes them from the incremental shard. If > that process is successful, it updates the stored minimum DID value for the > incremental. Each day, one of the static indexes (currently 13GB and 7.6 > million records) is optimized. > > You might wonder how we deal with the fact that when a record is changed, the > old one might remain in the index for as long as 11 minutes before the delete > process finally removes it. We assume that the incremental index, being less > than 10% of the size of the static indexes, will always respond faster. > Since the updated copy of the record will always be in the incremental, it > should respond first to the distributed query and therefore be the one that > is included in the results. That assumption seems to be correct so far. > > Shawn >