(I adjusted the subject to better reflect the content of this discussion).

On 2010-09-06 14:37, MitchK wrote:

Thanks for your detailed feedback Andzej!

From what I understood, SOLR-1301 becomes obsolete ones Solr becomes
cloud-ready, right?

Who knows... I certainly didn't expect this code to become so popular ;) so even after SolrCloud becomes available it's likely that some people will continue to use it. But SolrCloud should solve the original problem that I tried to solve with this patch.

Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards')

Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
let's assume it).
If I got a constant number of shards, the doc will be published to the same
shard again and again.

i.e.: 10 % numShards(5) = 2 ->  doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) ->   1 ->  doc 10 will be indexed at shard 1... and what
about the older version at shard 2? I am no expert when it comes to
cloudComputing and the other stuff.

There are several possible solutions to this, and they all boil down to
the way how you assign documents to shards... Keep in mind that nodes (physical machines) can manage several shards, and the aggregate collection of all unique shards across all nodes forms your whole index - so there's also a related, but different issue, of how to assign shards to nodes.

Here are some scenarios how you can solve the doc-to-shard mapping problem (note: I removed the issue of replication from the picture to make this clearer):

a) keep the number of shards constant no matter how large is the cluster. The mapping schema is then as simple as the one above. In this scenario you create relatively small shards, so that a single physical node can manage dozens of shards (each shard using one core, or perhaps a more lightweight structure like MultiReader). This is also known as micro-sharding. As the number of documents grows the size of each shard will grow until you have to reduce the number of shards per node, ultimately ending up with a single shard per node. After that, if your collection continues to grow, you have to modify your hashing schema to split some shards (and reindex some shards, or use an index splitter tool).

b) use consistent hashing as the mapping schema to assign documents to a changing number of shards. There are many explanations of this schema on the net, here's one that is very simple:

http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

In this case, you can grow/shrink the number of shards (and their size) as you see fit, incurring only a small reindexing cost.

If you can point me to one or another reference where I can read about it,
it would help me a lot, since I only want to understand how it works at the
moment.

http://wiki.apache.org/solr/SolrCloud ...


The problem with Solr is its lack of documentation in some classes and the
lack of capsulating some very complex things into different methods or
extra-classes. Of course, this is because it costs some extra time to do so,
but it makes understanding and modifying things very complicated if you do
not understand whats going on from a theoretical point of view.

In this case the lack of good docs and user-level API can be blamed on the fact that this functionality is still under heavy development.


Since the cloud-feature will be complex, a lack of documentation and no
understanding of the theory behind the code will make contributing back
very, very complicated.

For now, yes, it's an issue - though as soon as SolrCloud gets committed I'm sure people will follow up with user-level convenience components that will make it easier.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to