Re: Strategy for re-indexing

Allistair Crossley Fri, 08 Oct 2010 09:49:32 -0700

Thanks for your time responding to this. I have decided also to go down the 
route of cron-scheduled Perl LWP pings to DIH + deltaQueries. This seems to 
work inline with what the business requires and for the index size.


Thanks again

On Oct 7, 2010, at 7:46 AM, Shawn Heisey wrote:

> On 10/6/2010 10:49 AM, Allistair Crossley wrote:
>> Hi,
>> 
>> I was interested in gaining some insight into how you guys schedule updates 
>> for your Solr index (I have a single index).
>> 
>> Right now during development I have added deltaQuery specifications to data 
>> import entities to control the number of rows being queries on re-indexes.
>> 
>> However in terms of *when* to reindex we have a lot going on in the system - 
>> there are 4 sub-systems: custom application data, a CMS, a forum and a blog. 
>> It's all being indexed and at any given time there will be users and 
>> administrators all updating various parts of the sub-systems.
>> 
>> For the time being during development I have been issuing reindexes to the 
>> data import handler on each CRUD on any given sub-system. This has been 
>> working fine to be honest. It does need to be as immediate as possible - a 
>> scheduled update won't work for us. Even every 10 minutes is probably not 
>> fast enough.
>> 
>> So I wonder what others do. Is anyone else in a similar situation?
>> 
>> And what happens if 4 users generate 4 different requests to the data import 
>> handler to update for different types of data?  The DIH will be running 
>> already let's say for request 1, then request 2 comes in - is it rejected? 
>> Or is it queued?
>> 
>> I need it to be queued and serviced because the request 1 re-index may have 
>> already run its queries but missed the data added by the user for request 2. 
>> Same then goes for the requests 3 and 4.
> 
> I can't say whether the DIH will properly handle concurrent requests or not.  
> I figure it's always best to assume that things like this won't work and find 
> an elegant way to design around it.
> 
> I wrote my build system in perl (using LWP and LWP::Simple), and assumed that 
> the DIH would not let me run concurrent delta-imports.  We settled on every 
> two minutes for our update frequency, and use cron for scheduling.  Two of my 
> servers (VMs, actually) are a heartbeat cluster running HAProxy for load 
> balancing, which I implemented purely for redundancy, not for scalability.  
> Whichever host in the heartbeat cluster is online is the one that runs the 
> cronjobs.
> 
> I have the following processes and schedules:
> 
> idxUpdate: Runs every two minutes.  This script imports new data, based on an 
> autoincrement primary key in the database, the field is DID.  From the 
> database perspective, changed data looks like new data - it gets its DID 
> updated but another unique field (TAG_ID) stays the same.  Solr uses TAG_ID 
> as its uniqueKey.  Updates go into an incremental shard that is relatively 
> small - usually less than 1GB and 500,000 documents.  At the top of the hour, 
> the update includes a call to optimize.
> 
> idxDelete: Runs every ten minutes starting at xx:01.  This script gets the 
> list of newly deleted documents by DID.  Then, 1024 of them at a time, it 
> queries every shard for this list and issues a delete if they are found.  
> After the entire list is complete, it issues a commit to any shard that was 
> actually changed.  This increases the lifespan of indexSearchers and Solr 
> caches.  At the top of each hour, it reads the entire list of deletes instead 
> of new ones, and trims the delete list to the last 48 hours.
> 
> idxRrdUpdate: Runs once an hour. This simply records the current MAX(DID) 
> from the database into an RRD database.  I keep it in both a counter and a 
> gauge.  One day I will track other statistical data about my system and make 
> it all into pretty graphs.
> 
> idxDistribute: Runs once a day.  This uses the historical data in the RRD 
> database to decide which incremental data is older than one week.  Once it 
> has that information (a DID range), it distributes those records to each of 
> the six static index shards and deletes them from the incremental shard.  If 
> that process is successful, it updates the stored minimum DID value for the 
> incremental.  Each day, one of the static indexes (currently 13GB and 7.6 
> million records) is optimized.
> 
> You might wonder how we deal with the fact that when a record is changed, the 
> old one might remain in the index for as long as 11 minutes before the delete 
> process finally removes it.  We assume that the incremental index, being less 
> than 10% of the size of the static indexes, will always respond faster.  
> Since the updated copy of the record will always be in the incremental, it 
> should respond first to the distributed query and therefore be the one that 
> is included in the results.  That assumption seems to be correct so far.
> 
> Shawn
>

Re: Strategy for re-indexing

Reply via email to