Re: basic solr cloud questions

2011-09-30 Thread Pulkit Singhal
SOLR-2355 is definitely a step in the right direction but something I
would like to get clarified:

a) There were some fixes to it that went on the 3.4  3.5 branch based
on the comments section ... are they not available or not needed on
4.x trunk?

b) Does this basic implementation distribute across shards or across
cores? I think that distributing across all the cores in a shard is
the key towards using it successfully with SolrCloud and I really
don't know if this does this right now as I am not familiar with the
source code. If someone could answer this it would be great otherwise
I'll post back eventually when I do become familiar.

Cheers,
- Pulkit


Re: basic solr cloud questions

2011-09-30 Thread Pulkit Singhal
BTW I update the wiki with the following, hope it keeps it simpel for
others starting out:

Example B: Simple two shard cluster with shard replicas
Note: This setup leverages copy/paste to setup 2 cores per shard and
distributed searches validate a succesful completion of this
example/exercise. But DO NOT assume that any new data that you index
will be distributed across and indexes at each core of a given shard.
That will not happen. Distributed Indexing is not part of SolrCloud
yet. You may however adapt a basic implementation of distributed
indexing by referring to SOLR-2355.

On Fri, Sep 30, 2011 at 11:26 AM, Pulkit Singhal
pulkitsing...@gmail.com wrote:
 SOLR-2355 is definitely a step in the right direction but something I
 would like to get clarified:

 a) There were some fixes to it that went on the 3.4  3.5 branch based
 on the comments section ... are they not available or not needed on
 4.x trunk?

 b) Does this basic implementation distribute across shards or across
 cores? I think that distributing across all the cores in a shard is
 the key towards using it successfully with SolrCloud and I really
 don't know if this does this right now as I am not familiar with the
 source code. If someone could answer this it would be great otherwise
 I'll post back eventually when I do become familiar.

 Cheers,
 - Pulkit



Re: basic solr cloud questions

2011-09-30 Thread Mark Miller
Thanks Pulkit!

I'd actually been meaning to add the post.jar commands needed to index a doc to 
each shard to the wiki. Waiting till I streamline a few things though.

- Mark

On Sep 30, 2011, at 12:35 PM, Pulkit Singhal wrote:

 BTW I update the wiki with the following, hope it keeps it simpel for
 others starting out:
 
 Example B: Simple two shard cluster with shard replicas
 Note: This setup leverages copy/paste to setup 2 cores per shard and
 distributed searches validate a succesful completion of this
 example/exercise. But DO NOT assume that any new data that you index
 will be distributed across and indexes at each core of a given shard.
 That will not happen. Distributed Indexing is not part of SolrCloud
 yet. You may however adapt a basic implementation of distributed
 indexing by referring to SOLR-2355.
 
 On Fri, Sep 30, 2011 at 11:26 AM, Pulkit Singhal
 pulkitsing...@gmail.com wrote:
 SOLR-2355 is definitely a step in the right direction but something I
 would like to get clarified:
 
 a) There were some fixes to it that went on the 3.4  3.5 branch based
 on the comments section ... are they not available or not needed on
 4.x trunk?
 
 b) Does this basic implementation distribute across shards or across
 cores? I think that distributing across all the cores in a shard is
 the key towards using it successfully with SolrCloud and I really
 don't know if this does this right now as I am not familiar with the
 source code. If someone could answer this it would be great otherwise
 I'll post back eventually when I do become familiar.
 
 Cheers,
 - Pulkit
 

- Mark Miller
lucidimagination.com
2011.lucene-eurocon.org | Oct 17-20 | Barcelona












Re: basic solr cloud questions

2011-09-30 Thread Yury Kats
On 9/30/2011 12:26 PM, Pulkit Singhal wrote:
 SOLR-2355 is definitely a step in the right direction but something I
 would like to get clarified:

Questions about SOLR-2355 are best asked in SOLR-2355 :)
 b) Does this basic implementation distribute across shards or across
 cores? 

From a brief look, it seems to assume shard=core. You list
all cores in the config file under shards.



Re: basic solr cloud questions

2011-09-29 Thread Darren Govoni

That was kinda my point. The new cloud implementation
is not about replication, nor should it be. But rather about
horizontal scalability where nodes manage different parts
of a unified index. One of the design goals of the new cloud
implementation is for this to happen more or less automatically.

To me that means one does not have to manually distributed
documents or enforce replication as Yurly suggests.
Replication is different to me than what was being asked.
And perhaps I misunderstood the original question.

Yurly's response introduced the term core where the original
person was referring to nodes. For all I know, those are two
different things in the new cloud design terminology (I believe they are).

I guess understanding cores vs. nodes vs shards is helpful. :)

cheers!
Darren


On 09/29/2011 12:00 AM, Pulkit Singhal wrote:

@Darren: I feel that the question itself is misleading. Creating
shards is meant to separate out the data ... not keep the exact same
copy of it.

I think the two node setup that was attempted by Sam mislead him and
us into thinking that configuring two nodes which are to be named
shard1 ... somehow means that they are instantly replicated too ...
this is not the case! I can see how this misunderstanding can develop
as I too was confused until Yury cleared it up.

@Sam: If you are interested in performing a quick exercise to
understand the pieces involved for replication rather than sharding
... perhaps this link would be of help in taking you through it:
http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html

- Pulkit

2011/9/27 Yury Katsyuryk...@yahoo.com:

On 9/27/2011 5:16 PM, Darren Govoni wrote:

On 09/27/2011 05:05 PM, Yury Kats wrote:

You need to either submit the docs to both nodes, or have a replication
setup between the two. Otherwise they are not in sync.

I hope that's not the case. :/ My understanding (or hope maybe) is that
the new Solr Cloud implementation will support auto-sharding and
distributed indexing. This means that shards will receive different
documents regardless of which node received the submitted document
(spread evenly based on a hash-node assignment). Distributed queries
will thus merge all the solr shard/node responses.

All cores in the same shard must somehow have the same index.
Only then can you continue servicing searches when individual cores
fail. Auto-sharding and distributed indexing don't have anything to
do with this.

In the future, SolrCloud may be managing replication between cores
in the same shard automatically. But right now it does not.





Re: basic solr cloud questions

2011-09-29 Thread Yury Kats
On 9/29/2011 7:22 AM, Darren Govoni wrote:
 That was kinda my point. The new cloud implementation
 is not about replication, nor should it be. But rather about
 horizontal scalability where nodes manage different parts
 of a unified index. 

It;s about many things. You stated one, but there are goals,
one of them being tolerance to node outages. In a cloud, when
one of your many nodes fail, you don't want to stop querying and
indexing. For this to happen, you need to maintain redundant copies
of the same pieces of the index, hence you need to replicate.

 One of the design goals of the new cloud
 implementation is for this to happen more or less automatically.

True, but there is a big gap between goals and current state.
Right now, there is distributed search, but not distributed indexing
or auto-sharding, or auto-replication. So if you want to use the SolrCloud
now (as many of us do), you need do a number of things yourself,
even if they might be done by SolrCloud automatically in the future.

 To me that means one does not have to manually distributed
 documents or enforce replication as Yurly suggests.
 Replication is different to me than what was being asked.
 And perhaps I misunderstood the original question.
 
 Yurly's response introduced the term core where the original
 person was referring to nodes. For all I know, those are two
 different things in the new cloud design terminology (I believe they are).
 
 I guess understanding cores vs. nodes vs shards is helpful. :)

Shard is a slice of index. Index is managed/stored in a core.
Nodes are Solr instances, usually physical machines.

Each node can host multiple shards, and each shard can consist of multiple 
cores.
However, all cores within the same shard must have the same content.

This is where the OP ran into the problem. The OP had 1 shard, consisting of two
cores on two nodes. Since there is no distributed indexing yet, all documents 
were
indexed into a single core. However, there is distributed search, therefore 
queries
were sent randomly to different cores of the same shard. Since one core in the 
shard
had documents and the other didn't, the query result was random.

To solve this problem, the OP must make sure all cores within the same shard 
(be they
on the same node or not) have the same content. This can currently be achieved 
by:
a) setting up replication between cores. you index into one core and the other 
core
replicates the content
b) indexing into both cores

Hope this clarifies.


Re: basic solr cloud questions

2011-09-29 Thread Darren Govoni

Agree. Thanks also for clarifying. It helps.

On 09/29/2011 08:50 AM, Yury Kats wrote:

On 9/29/2011 7:22 AM, Darren Govoni wrote:

That was kinda my point. The new cloud implementation
is not about replication, nor should it be. But rather about
horizontal scalability where nodes manage different parts
of a unified index.

It;s about many things. You stated one, but there are goals,
one of them being tolerance to node outages. In a cloud, when
one of your many nodes fail, you don't want to stop querying and
indexing. For this to happen, you need to maintain redundant copies
of the same pieces of the index, hence you need to replicate.


One of the design goals of the new cloud
implementation is for this to happen more or less automatically.

True, but there is a big gap between goals and current state.
Right now, there is distributed search, but not distributed indexing
or auto-sharding, or auto-replication. So if you want to use the SolrCloud
now (as many of us do), you need do a number of things yourself,
even if they might be done by SolrCloud automatically in the future.


To me that means one does not have to manually distributed
documents or enforce replication as Yurly suggests.
Replication is different to me than what was being asked.
And perhaps I misunderstood the original question.

Yurly's response introduced the term core where the original
person was referring to nodes. For all I know, those are two
different things in the new cloud design terminology (I believe they are).

I guess understanding cores vs. nodes vs shards is helpful. :)

Shard is a slice of index. Index is managed/stored in a core.
Nodes are Solr instances, usually physical machines.

Each node can host multiple shards, and each shard can consist of multiple 
cores.
However, all cores within the same shard must have the same content.

This is where the OP ran into the problem. The OP had 1 shard, consisting of two
cores on two nodes. Since there is no distributed indexing yet, all documents 
were
indexed into a single core. However, there is distributed search, therefore 
queries
were sent randomly to different cores of the same shard. Since one core in the 
shard
had documents and the other didn't, the query result was random.

To solve this problem, the OP must make sure all cores within the same shard 
(be they
on the same node or not) have the same content. This can currently be achieved 
by:
a) setting up replication between cores. you index into one core and the other 
core
replicates the content
b) indexing into both cores

Hope this clarifies.




Re: basic solr cloud questions

2011-09-29 Thread Sami Siren
2011/9/29 Yury Kats yuryk...@yahoo.com:
 True, but there is a big gap between goals and current state.
 Right now, there is distributed search, but not distributed indexing
 or auto-sharding, or auto-replication. So if you want to use the SolrCloud
 now (as many of us do), you need do a number of things yourself,
 even if they might be done by SolrCloud automatically in the future.

There is a patch in Jira: https://issues.apache.org/jira/browse/SOLR-2355
that adds a update processor suitable for doing simple distributed
indexing with current version of Solr.

--
 Sami Siren


Re: basic solr cloud questions

2011-09-28 Thread Pulkit Singhal
@Darren: I feel that the question itself is misleading. Creating
shards is meant to separate out the data ... not keep the exact same
copy of it.

I think the two node setup that was attempted by Sam mislead him and
us into thinking that configuring two nodes which are to be named
shard1 ... somehow means that they are instantly replicated too ...
this is not the case! I can see how this misunderstanding can develop
as I too was confused until Yury cleared it up.

@Sam: If you are interested in performing a quick exercise to
understand the pieces involved for replication rather than sharding
... perhaps this link would be of help in taking you through it:
http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html

- Pulkit

2011/9/27 Yury Kats yuryk...@yahoo.com:
 On 9/27/2011 5:16 PM, Darren Govoni wrote:
 On 09/27/2011 05:05 PM, Yury Kats wrote:
 You need to either submit the docs to both nodes, or have a replication
 setup between the two. Otherwise they are not in sync.
 I hope that's not the case. :/ My understanding (or hope maybe) is that
 the new Solr Cloud implementation will support auto-sharding and
 distributed indexing. This means that shards will receive different
 documents regardless of which node received the submitted document
 (spread evenly based on a hash-node assignment). Distributed queries
 will thus merge all the solr shard/node responses.

 All cores in the same shard must somehow have the same index.
 Only then can you continue servicing searches when individual cores
 fail. Auto-sharding and distributed indexing don't have anything to
 do with this.

 In the future, SolrCloud may be managing replication between cores
 in the same shard automatically. But right now it does not.



basic solr cloud questions

2011-09-27 Thread Sam Jiang
Hi all

I'm a relatively new solr user, and recently I discovered the interesting
solr cloud feature. I have some basic questions:
(please excuse me if I get the terminologies wrong)

- from my understanding, this is still a work in progress. How mature is it?
Is there any estimate on the official release?

- has the solr_cluster.properties configuration been implemented? it's
mentioned in http://wiki.apache.org/solr/NewSolrCloudDesign. I was trying to
play with it a bit but I couldn't find the file.

- I tried to to setup a two node, 1 shard cluster, e.g. active active solr
with fault tolerance. (this isn't possible with the old replication feature
right?) I have both instances of solr configured to use core
name=collection1 instanceDir=. shard=shard1/, and I started each
instance with its own instance of zookeeper to form an ensemble. From the
zookeeper admin page, I can see both nodes under shard1. I can submit
documents fine. However, when I do a search, it appears that only one node
has the submitted documents. (e.g. if I keep refreshing, I get different
results depending on which node gets assigned the work). My search url is
http://localhost:8983/solr/collection1/select?distrib=trueq=*.*. Did I miss
something?

thanks


Re: basic solr cloud questions

2011-09-27 Thread Darren Govoni

On 09/27/2011 05:05 PM, Yury Kats wrote:

You need to either submit the docs to both nodes, or have a replication
setup between the two. Otherwise they are not in sync.
I hope that's not the case. :/ My understanding (or hope maybe) is that 
the new Solr Cloud implementation will support auto-sharding and 
distributed indexing. This means that shards will receive different 
documents regardless of which node received the submitted document 
(spread evenly based on a hash-node assignment). Distributed queries 
will thus merge all the solr shard/node responses.


This is similar in theory to how memcache and other big scale DHT's 
work. If its just manually replicated indexes then its not really a step 
forward from current Solr. :/





Re: basic solr cloud questions

2011-09-27 Thread Yury Kats
On 9/27/2011 5:16 PM, Darren Govoni wrote:
 On 09/27/2011 05:05 PM, Yury Kats wrote:
 You need to either submit the docs to both nodes, or have a replication
 setup between the two. Otherwise they are not in sync.
 I hope that's not the case. :/ My understanding (or hope maybe) is that 
 the new Solr Cloud implementation will support auto-sharding and 
 distributed indexing. This means that shards will receive different 
 documents regardless of which node received the submitted document 
 (spread evenly based on a hash-node assignment). Distributed queries 
 will thus merge all the solr shard/node responses.

All cores in the same shard must somehow have the same index.
Only then can you continue servicing searches when individual cores
fail. Auto-sharding and distributed indexing don't have anything to
do with this.

In the future, SolrCloud may be managing replication between cores
in the same shard automatically. But right now it does not.