SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Andrzej Bialecki Mon, 06 Sep 2010 06:45:23 -0700

(I adjusted the subject to better reflect the content of this discussion).


On 2010-09-06 14:37, MitchK wrote:


Thanks for your detailed feedback Andzej!

From what I understood, SOLR-1301 becomes obsolete ones Solr becomes

cloud-ready, right?

Who knows... I certainly didn't expect this code to become so popular ;)so even after SolrCloud becomes available it's likely that some peoplewill continue to use it. But SolrCloud should solve the original problemthat I tried to solve with this patch.

Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards')

Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
let's assume it).
If I got a constant number of shards, the doc will be published to the same
shard again and again.

i.e.: 10 % numShards(5) = 2 ->  doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) ->   1 ->  doc 10 will be indexed at shard 1... and what
about the older version at shard 2? I am no expert when it comes to
cloudComputing and the other stuff.


There are several possible solutions to this, and they all boil down to

the way how you assign documents to shards... Keep in mind that nodes(physical machines) can manage several shards, and the aggregatecollection of all unique shards across all nodes forms your whole index- so there's also a related, but different issue, of how to assignshards to nodes.

Here are some scenarios how you can solve the doc-to-shard mappingproblem (note: I removed the issue of replication from the picture tomake this clearer):

a) keep the number of shards constant no matter how large is thecluster. The mapping schema is then as simple as the one above. In thisscenario you create relatively small shards, so that a single physicalnode can manage dozens of shards (each shard using one core, or perhapsa more lightweight structure like MultiReader). This is also known asmicro-sharding. As the number of documents grows the size of each shardwill grow until you have to reduce the number of shards per node,ultimately ending up with a single shard per node. After that, if yourcollection continues to grow, you have to modify your hashing schema tosplit some shards (and reindex some shards, or use an index splitter tool).

b) use consistent hashing as the mapping schema to assign documents to achanging number of shards. There are many explanations of this schema onthe net, here's one that is very simple:


http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

In this case, you can grow/shrink the number of shards (and their size)as you see fit, incurring only a small reindexing cost.

If you can point me to one or another reference where I can read about it,
it would help me a lot, since I only want to understand how it works at the
moment.


http://wiki.apache.org/solr/SolrCloud ...


The problem with Solr is its lack of documentation in some classes and the
lack of capsulating some very complex things into different methods or
extra-classes. Of course, this is because it costs some extra time to do so,
but it makes understanding and modifying things very complicated if you do
not understand whats going on from a theoretical point of view.

In this case the lack of good docs and user-level API can be blamed onthe fact that this functionality is still under heavy development.


Since the cloud-feature will be complex, a lack of documentation and no
understanding of the theory behind the code will make contributing back
very, very complicated.

For now, yes, it's an issue - though as soon as SolrCloud gets committedI'm sure people will follow up with user-level convenience componentsthat will make it easier.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Reply via email to