Thanks, Upayavira and Shawn, for sharing what is happening internally.   Your 
points (cluster state explosion, segment per commit, solr/lucene split) are 
well taken.

Wishful thinking aside, my gut instinct is that such a scheme would cause 
Solr's stellar indexing speed drop dramatically to match mongodb (indexing 
speed is not its strong point) ...   :-)

-----Original Message-----
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: Tuesday, June 30, 2015 2:46 PM
To: solr-user@lucene.apache.org
Subject: Re: optimize status



On Tue, Jun 30, 2015, at 04:42 PM, Shawn Heisey wrote:
> On 6/29/2015 2:48 PM, Reitzel, Charles wrote:
> > I take your point about shards and segments being different things.  I 
> > understand that the hash ranges per segment are not kept in ZK.   I guess I 
> > wish they were.
> >
> > In this regard, I liked Mongodb, uses a 2-level sharding scheme.   Each 
> > shard manages a list of  "chunks", each has its own hash range which is 
> > kept in the cluster state.   If data needs to be balanced across nodes, it 
> > works at the chunk level.  No record/doc level I/O is necessary.   Much 
> > more targeted and only the data that needs to move is touched.  Solr does 
> > most things better than Mongo, imo.  But this is one area where the Mongo 
> > got it right.
> 
> Segment detail would not only lead to a data explosion in the 
> clusterstate, it would be crossing abstraction boundaries, and would 
> potentially require updating the clusterstate just because a single 
> document was inserted into the index.  That one tiny update could (and 
> probably would) create a new segment on one shard.  Due to the way 
> SolrCloud replicates data during normal operation, every replica for a 
> given shard might have a different set of segments, which means 
> segments would need to be tracked at the replica level, not the shard level.
> 
> Also, Solr cannot control which hash ranges end up in each segment. 
> Solr only knows about the index as a whole ... implementation details 
> like segments are left entirely up to Lucene, and although I admit to 
> not knowing Lucene internals very well, I don't think Lucene offers 
> any way to control that either.  You mention that MongoDB dictates 
> which hash ranges end up in each chunk.  That implies that MongoDB can 
> control each chunk.  If we move the analogy to Solr, it breaks down 
> because Solr cannot control segments.  Although Solr does have several 
> configuration knobs that affect how segments are created, those 
> configurations are simply passed through to Lucene, Solr itself does 
> not use that information.

To put it more specifically - when a (hard) commit happens, all of the 
documents in that commit are written into a new segment. Thus, it has no 
bearing on what hash range is used. A segment can never be edited. When there 
are too many, segments are merged into a new one, and the originals deleted. 
So, there is no way for Solr/Lucene to insert a document into anything other 
than a brand new segment.

Hence, the idea of using a second level of sharding at the segment level does 
not fit with how a lucene index is structured.

Upayavira

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

Reply via email to