Re: True master-master fail-over without data gaps (choosing CA in CAP)

Otis Gospodnetic Wed, 09 Mar 2011 09:47:08 -0800

Hi,

 
----- Original Message ----
> From: Walter Underwood <wun...@wunderwood.org>

> On Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
> 
> > You mean it's  not possible to have 2 masters that are in nearly real-time 
>sync?
> > How  about with DRBD?  I know people use DRBD to keep 2 Hadoop NNs (their 
>edit 
>
> > logs) in sync to avoid the current NN SPOF, for example, so I'm  thinking 
>this 
>
> > could be doable with Solr masters, too, no?
> 
> If you add fault-tolerant, you run into the CAP  Theorem. Consistency, 
>availability, partition: choose two. You cannot have it  all.

Right, so I'll take Consistency and Availability, and I'll put my 2 masters in 
the same rack (which has redundant switches, power supply, etc.) and thus 
minimize/avoid partitioning.
Assuming the above actually works, I think my Q remains:

How do you set up 2 Solr masters so they are in near real-time sync?  DRBD?

But here is maybe a simpler scenario that more people may be considering:

Imagine 2 masters on 2 different servers in 1 rack, pointing to the same index 
on the shared storage (SAN) that also happens to live in the same rack.
2 Solr masters are behind 1 LB VIP that indexer talks to.
The VIP is configured so that all requests always get routed to the primary 
master (because only 1 master can be modifying an index at a time), except when 
this primary is down, in which case the requests are sent to the secondary 
master.

So in this case my Q is around automation of this, around Lucene index locks, 
around the need for manual intervention, and such.
Concretely, if you have these 2 master instances, the primary master has the 
Lucene index lock in the index dir.  When the secondary master needs to take 
over (i.e., when it starts receiving documents via LB), it needs to be able to 
write to that same index.  But what if that lock is still around?  One could 
use 
the Native lock to make the lock disappear if the primary master's JVM exited 
unexpectedly, and in that case everything *should* work and be completely 
transparent, right?  That is, the secondary will start getting new docs, it 
will 
use its IndexWriter to write to that same shared index, which won't be locked 
for writes because the lock is gone, and everyone will be happy.  Did I miss 
something important here?

Assuming the above is correct, what if the lock is *not* gone because the 
primary master's JVM is actually not dead, although maybe unresponsive, so LB 
thinks the primary master is dead.  Then the LB will route indexing requests to 
the secondary master, which will attempt to write to the index, but be denied 
because of the lock.  So a human needs to jump in, remove the lock, and 
manually 
reindex failed docs if the upstream component doesn't buffer docs that failed 
to 
get indexed and doesn't retry indexing them automatically.  Is this correct or 
is there a way to avoid humans here?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Re: True master-master fail-over without data gaps (choosing CA in CAP)

Reply via email to