Re: solr cloud concepts

2012-02-09 Thread Mark Miller

On Feb 8, 2012, at 11:27 PM, Adeel Qureshi wrote:

> to
> create new collections its not that automated right ..

It can be fairly automated...if you have uploaded the configuration sets for 
both collections, you can basically then create new collections that use one of 
those configuration sets using CoreAdminHandler commands. You would just create 
as many SolrCores as the number of instances (or shards) that you wanted, 
specifying which collection they belong to when you do. Essentially a new 
collection is created the first time a SolrCore that has been set to a new 
collection name starts.

So if you wanted a new collection with new configuration you would do something 
like:

* upload new configuration files and call them config2

* Using the CoreAdminHandler, create a new core on Solr instance 1 with 
collection name 'collection2' and use the conf set 'config2' and shard it into 
2 so that the index will span 2 Solr instances. This will get auto assigned 
shard1.

*  Using the CoreAdminHandler, create a new core on Solr instance 2 with 
collection name 'collection2' and use the conf set 'config2'. This will get 
auto assigned shard2.

*  Using the CoreAdminHandler,create a new core on Solr instance 3 with 
collection name 'collection2' and use the conf set 'config2'. This will 
replicate 1 or 2 for query load and data redundancy.

*  Using the CoreAdminHandler,create a new core on Solr instance 4 with 
collection name 'collection2' and use the conf set 'config2'. This will host 
shard2.  This will replicate 1 or 2 for query load and data redundancy.

That would give you 4 instances with half your index for the collection on 2 
instances, the other half on 2 other instances. Each half will will have a 
duplicate instance so you have 2 copies of the index in the cluster.

- Mark Miller
lucidimagination.com













Re: solr cloud concepts

2012-02-09 Thread Bruno Dumon
On Thu, Feb 9, 2012 at 5:27 AM, Adeel Qureshi wrote:

> Thanks for the explanation. It makes sense but I am hoping that you can
> clarify things a bit more ..
>
> so now it sounds like in solrcloud the concept of cores have changed a bit
> .. as you explained that for me to have 2 cores with different schemas I
> will need 2 different collections .. and one good thing about solrcores was
> that you could create new ones with coreadmin api or the http calls .. to
> create new collections its not that automated right ..
>
> secondly if collections represent what kind of used to be solrcore then
> once i have a collection .. why would i ever want to add multiple cores to
> it .. i mean i am trying to think of a reason why it would make sense to do
> that.
>

Hi Adeel,

A core is still what it was before: it provides indexing & search for one
physical index. The concepts of collections and slices layer on top of it.
A core corresponds onto-to-one with a shard.

So you have:
collection -> slice -> shard = core

Each slice contains a subset of the data of the collection. All the shards
within one slice are replica's, thus contain the same data. All the actual
data/indexes are in the cores. Collections, slices and shards are logical
concepts that only exist in ZooKeeper. Thus a collection in itself isn't a
physical index, it are only the cores below it that contain actual data.

All the cores within one collection will use the same schema (the schema
associated with the collection), since they are part of the same logical
index.

You can still use the coreadmin API to create cores (that's what I've done
in my blog), but in SolrCloud a core must always be associated with a
[slice in a] collection. Thus when you create a core it either becomes part
of an existing collection, or a new collection is created.

HTH,

Bruno.

-- 
Bruno Dumon
Outerthought
http://outerthought.org/


Re: solr cloud concepts

2012-02-08 Thread Adeel Qureshi
Thanks for the explanation. It makes sense but I am hoping that you can
clarify things a bit more ..

so now it sounds like in solrcloud the concept of cores have changed a bit
.. as you explained that for me to have 2 cores with different schemas I
will need 2 different collections .. and one good thing about solrcores was
that you could create new ones with coreadmin api or the http calls .. to
create new collections its not that automated right ..

secondly if collections represent what kind of used to be solrcore then
once i have a collection .. why would i ever want to add multiple cores to
it .. i mean i am trying to think of a reason why it would make sense to do
that.

Thanks


On Wed, Feb 8, 2012 at 4:41 PM, Mark Miller  wrote:

>
> On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote:
>
> > okay so after reading Bruno's blog post .. lets add slice to the mix as
> > well .. so we have got collections, cores, shards, partitions and slices
> :)
> > ..
>
> Yeah - heh - this has bugged me, but we have not really all come down on
> agreement of terminology here. I was a fan of using shard for each node and
> slice for partition. Another couple of committers wanted partitions rather
> than slice. Another says slice in code, shard for both in terminology and
> use context...
>
> I'd even go for shards as partitions and replicas for every node in a
> shard. But those fine points are still settling ;)
>
> >
> > The whole point with cores is to be able to have different schemas on the
> > same solr server instance. So how does that changes with collections ..
> may
> > be an example might help .. if I want to setup a solrcloud cluster with 2
> > cores (different schema) .. with each core having 2 shards (i m assuming
> > shards are really partitions here, across multiple nodes in the cluster)
> ..
> > with one shard being the replica..
>
> So this would mean you want to create 2 collections. Think of a collection
> as a bunch of SolrCores that all share the same schema and config.
>
> So you would start up 2 nodes set to one collection and with numShards=1
> that will give you one shard hosted by two identical SolrCores, giving you
> a replication factor. The full index will be in each of the two SolrCores.
>
> Then if you start another two nodes and specify a different collection
> name, you will get the same thing, but distinct from your first collection
> (although, if both collections have compatible shema/config you can still
> search across them).
>
> >
> >
> > On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller 
> wrote:
> >
> >>
> >> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
> >>
> >>> I have been using solr for a while and have recently started getting
> into
> >>> solrcloud .. i am a bit confused with some of the concepts ..
> >>>
> >>> 1. what exactly is the relationship between a collection and the core
> ..
> >>> can a core has multiple collections in it .. in this case all
> collections
> >>> within this core will have the same schema .. and i am assuming all
> >>> instances of collections within the core can be deployed on different
> >> solr
> >>> nodes to achieve distributed search ..
> >>> or is it the other way around where a collection can have multiple
> cores
> >>
> >> Currently, a core basically equals a replica of the index.
> >>
> >> So you might have a collection called collection1 - lets say it's 2
> shards
> >> and each shard has a single replica:
> >>
> >> Collection1
> >> shard1 replica1
> >> shard1 replica2
> >> shard2 replica1
> >> shard2 replica2
> >>
> >> Each of those replicas is a core. So a collection has multiple cores
> >> basically. Also, each of those cores can be on a different machine. So
> yes,
> >> you have distributed indexing and distributed search.
> >>
> >>>
> >>> 2. at some places it has been pointed out that solrcloud doesnt
> actually
> >>> supports replication .. but in the solrcloud wiki the second example is
> >>> supposed to be for replication .. so does solrcloud at this point
> >> supports
> >>> automatic replication where as you add more servers it automatically
> uses
> >>> the additional servers as replicas
> >>
> >> SolrCloud doesn't support the old style Solr replication concept. It
> does
> >> however, handle replication - it's just all pretty much automatic and
> >> behind the scenes - eg all the information about Solr replication in the
> >> wiki documentation for previous versions of Solr is really not
> applicable.
> >> We now achieve replica copies by sending documents to each shard one
> >> document at a time so that we can support near realtime search. The old
> >> style replication is only used in recovery, or when you start a new
> replica
> >> machine and it has to 'catchup' to the other replicas.
> >>
> >>>
> >>> I have a few more questions but I wanted to get these basic ones out of
> >> the
> >>> way first .. I would appreciate any response.
> >>
> >> Fire away.
> >>
> >>>
> >>> Thanks
> >>> Adeel
> >>
> >> - Mark Miller
> >> lucidimag

Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 9:36 PM, Jamie Johnson wrote:

> Mark,
> is the recommendation now to have each solr instance be a separate core in
> solr cloud? I had thought that the core name was by default the collection
> name? Or are you saying that although they have the same name they are
> separate because they are in different JVMs?

By default, the collection name is set to the core name. This is really just 
for convenience when you are getting started. If gives you a default collection 
name of collection1 because the default SolrCore name is collection1, and each 
SolrCore on each instance is addressable as /solr/collection1.

You can certainly have core names be whatever you want and explicitly pass it's 
collection. In the case, the url for each would be different - though I think 
there is an open JIRA issue about making that nicer - so that you can look up 
the right core even if you pass the collection name or something.

- Mark Miller
lucidimagination.com













Re: solr cloud concepts

2012-02-08 Thread Jamie Johnson
Mark,
is the recommendation now to have each solr instance be a separate core in
solr cloud? I had thought that the core name was by default the collection
name? Or are you saying that although they have the same name they are
separate because they are in different JVMs?

On Wednesday, February 8, 2012, Mark Miller  wrote:
>
> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
>
>> I have been using solr for a while and have recently started getting into
>> solrcloud .. i am a bit confused with some of the concepts ..
>>
>> 1. what exactly is the relationship between a collection and the core ..
>> can a core has multiple collections in it .. in this case all collections
>> within this core will have the same schema .. and i am assuming all
>> instances of collections within the core can be deployed on different
solr
>> nodes to achieve distributed search ..
>> or is it the other way around where a collection can have multiple cores
>
> Currently, a core basically equals a replica of the index.
>
> So you might have a collection called collection1 - lets say it's 2
shards and each shard has a single replica:
>
> Collection1
> shard1 replica1
> shard1 replica2
> shard2 replica1
> shard2 replica2
>
> Each of those replicas is a core. So a collection has multiple cores
basically. Also, each of those cores can be on a different machine. So yes,
you have distributed indexing and distributed search.
>
>>
>> 2. at some places it has been pointed out that solrcloud doesnt actually
>> supports replication .. but in the solrcloud wiki the second example is
>> supposed to be for replication .. so does solrcloud at this point
supports
>> automatic replication where as you add more servers it automatically uses
>> the additional servers as replicas
>
> SolrCloud doesn't support the old style Solr replication concept. It does
however, handle replication - it's just all pretty much automatic and
behind the scenes - eg all the information about Solr replication in the
wiki documentation for previous versions of Solr is really not applicable.
We now achieve replica copies by sending documents to each shard one
document at a time so that we can support near realtime search. The old
style replication is only used in recovery, or when you start a new replica
machine and it has to 'catchup' to the other replicas.
>
>>
>> I have a few more questions but I wanted to get these basic ones out of
the
>> way first .. I would appreciate any response.
>
> Fire away.
>
>>
>> Thanks
>> Adeel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 5:26 PM, Adeel Qureshi wrote:

> okay so after reading Bruno's blog post .. lets add slice to the mix as
> well .. so we have got collections, cores, shards, partitions and slices :)
> ..

Yeah - heh - this has bugged me, but we have not really all come down on 
agreement of terminology here. I was a fan of using shard for each node and 
slice for partition. Another couple of committers wanted partitions rather than 
slice. Another says slice in code, shard for both in terminology and use 
context...

I'd even go for shards as partitions and replicas for every node in a shard. 
But those fine points are still settling ;)

> 
> The whole point with cores is to be able to have different schemas on the
> same solr server instance. So how does that changes with collections .. may
> be an example might help .. if I want to setup a solrcloud cluster with 2
> cores (different schema) .. with each core having 2 shards (i m assuming
> shards are really partitions here, across multiple nodes in the cluster) ..
> with one shard being the replica..

So this would mean you want to create 2 collections. Think of a collection as a 
bunch of SolrCores that all share the same schema and config. 

So you would start up 2 nodes set to one collection and with numShards=1 that 
will give you one shard hosted by two identical SolrCores, giving you a 
replication factor. The full index will be in each of the two SolrCores.

Then if you start another two nodes and specify a different collection name, 
you will get the same thing, but distinct from your first collection (although, 
if both collections have compatible shema/config you can still search across 
them).

> 
> 
> On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller  wrote:
> 
>> 
>> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
>> 
>>> I have been using solr for a while and have recently started getting into
>>> solrcloud .. i am a bit confused with some of the concepts ..
>>> 
>>> 1. what exactly is the relationship between a collection and the core ..
>>> can a core has multiple collections in it .. in this case all collections
>>> within this core will have the same schema .. and i am assuming all
>>> instances of collections within the core can be deployed on different
>> solr
>>> nodes to achieve distributed search ..
>>> or is it the other way around where a collection can have multiple cores
>> 
>> Currently, a core basically equals a replica of the index.
>> 
>> So you might have a collection called collection1 - lets say it's 2 shards
>> and each shard has a single replica:
>> 
>> Collection1
>> shard1 replica1
>> shard1 replica2
>> shard2 replica1
>> shard2 replica2
>> 
>> Each of those replicas is a core. So a collection has multiple cores
>> basically. Also, each of those cores can be on a different machine. So yes,
>> you have distributed indexing and distributed search.
>> 
>>> 
>>> 2. at some places it has been pointed out that solrcloud doesnt actually
>>> supports replication .. but in the solrcloud wiki the second example is
>>> supposed to be for replication .. so does solrcloud at this point
>> supports
>>> automatic replication where as you add more servers it automatically uses
>>> the additional servers as replicas
>> 
>> SolrCloud doesn't support the old style Solr replication concept. It does
>> however, handle replication - it's just all pretty much automatic and
>> behind the scenes - eg all the information about Solr replication in the
>> wiki documentation for previous versions of Solr is really not applicable.
>> We now achieve replica copies by sending documents to each shard one
>> document at a time so that we can support near realtime search. The old
>> style replication is only used in recovery, or when you start a new replica
>> machine and it has to 'catchup' to the other replicas.
>> 
>>> 
>>> I have a few more questions but I wanted to get these basic ones out of
>> the
>>> way first .. I would appreciate any response.
>> 
>> Fire away.
>> 
>>> 
>>> Thanks
>>> Adeel
>> 
>> - Mark Miller
>> lucidimagination.com
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

- Mark Miller
lucidimagination.com













Re: solr cloud concepts

2012-02-08 Thread Adeel Qureshi
okay so after reading Bruno's blog post .. lets add slice to the mix as
well .. so we have got collections, cores, shards, partitions and slices :)
..

The whole point with cores is to be able to have different schemas on the
same solr server instance. So how does that changes with collections .. may
be an example might help .. if I want to setup a solrcloud cluster with 2
cores (different schema) .. with each core having 2 shards (i m assuming
shards are really partitions here, across multiple nodes in the cluster) ..
with one shard being the replica..


On Wed, Feb 8, 2012 at 11:35 AM, Mark Miller  wrote:

>
> On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:
>
> > I have been using solr for a while and have recently started getting into
> > solrcloud .. i am a bit confused with some of the concepts ..
> >
> > 1. what exactly is the relationship between a collection and the core ..
> > can a core has multiple collections in it .. in this case all collections
> > within this core will have the same schema .. and i am assuming all
> > instances of collections within the core can be deployed on different
> solr
> > nodes to achieve distributed search ..
> > or is it the other way around where a collection can have multiple cores
>
> Currently, a core basically equals a replica of the index.
>
> So you might have a collection called collection1 - lets say it's 2 shards
> and each shard has a single replica:
>
> Collection1
> shard1 replica1
> shard1 replica2
> shard2 replica1
> shard2 replica2
>
> Each of those replicas is a core. So a collection has multiple cores
> basically. Also, each of those cores can be on a different machine. So yes,
> you have distributed indexing and distributed search.
>
> >
> > 2. at some places it has been pointed out that solrcloud doesnt actually
> > supports replication .. but in the solrcloud wiki the second example is
> > supposed to be for replication .. so does solrcloud at this point
> supports
> > automatic replication where as you add more servers it automatically uses
> > the additional servers as replicas
>
> SolrCloud doesn't support the old style Solr replication concept. It does
> however, handle replication - it's just all pretty much automatic and
> behind the scenes - eg all the information about Solr replication in the
> wiki documentation for previous versions of Solr is really not applicable.
> We now achieve replica copies by sending documents to each shard one
> document at a time so that we can support near realtime search. The old
> style replication is only used in recovery, or when you start a new replica
> machine and it has to 'catchup' to the other replicas.
>
> >
> > I have a few more questions but I wanted to get these basic ones out of
> the
> > way first .. I would appreciate any response.
>
> Fire away.
>
> >
> > Thanks
> > Adeel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


Re: solr cloud concepts

2012-02-08 Thread Bruno Dumon
Hi Adeel,

I just started looking into SolrCloud and had some of the same questions.

I wrote a blog with the understanding I gained so far, maybe it will help
you:

http://outerthought.org/blog/491-ot.html

Regards,

Bruno.

On Wed, Feb 8, 2012 at 4:31 PM, Adeel Qureshi wrote:

> I have been using solr for a while and have recently started getting into
> solrcloud .. i am a bit confused with some of the concepts ..
>
> 1. what exactly is the relationship between a collection and the core ..
> can a core has multiple collections in it .. in this case all collections
> within this core will have the same schema .. and i am assuming all
> instances of collections within the core can be deployed on different solr
> nodes to achieve distributed search ..
> or is it the other way around where a collection can have multiple cores
>
> 2. at some places it has been pointed out that solrcloud doesnt actually
> supports replication .. but in the solrcloud wiki the second example is
> supposed to be for replication .. so does solrcloud at this point supports
> automatic replication where as you add more servers it automatically uses
> the additional servers as replicas
>
> I have a few more questions but I wanted to get these basic ones out of the
> way first .. I would appreciate any response.
>
> Thanks
> Adeel
>



-- 
Bruno Dumon
Outerthought
http://outerthought.org/


Re: solr cloud concepts

2012-02-08 Thread Mark Miller

On Feb 8, 2012, at 10:31 AM, Adeel Qureshi wrote:

> I have been using solr for a while and have recently started getting into
> solrcloud .. i am a bit confused with some of the concepts ..
> 
> 1. what exactly is the relationship between a collection and the core ..
> can a core has multiple collections in it .. in this case all collections
> within this core will have the same schema .. and i am assuming all
> instances of collections within the core can be deployed on different solr
> nodes to achieve distributed search ..
> or is it the other way around where a collection can have multiple cores

Currently, a core basically equals a replica of the index.

So you might have a collection called collection1 - lets say it's 2 shards and 
each shard has a single replica:

Collection1
shard1 replica1
shard1 replica2
shard2 replica1
shard2 replica2

Each of those replicas is a core. So a collection has multiple cores basically. 
Also, each of those cores can be on a different machine. So yes, you have 
distributed indexing and distributed search.

> 
> 2. at some places it has been pointed out that solrcloud doesnt actually
> supports replication .. but in the solrcloud wiki the second example is
> supposed to be for replication .. so does solrcloud at this point supports
> automatic replication where as you add more servers it automatically uses
> the additional servers as replicas

SolrCloud doesn't support the old style Solr replication concept. It does 
however, handle replication - it's just all pretty much automatic and behind 
the scenes - eg all the information about Solr replication in the wiki 
documentation for previous versions of Solr is really not applicable. We now 
achieve replica copies by sending documents to each shard one document at a 
time so that we can support near realtime search. The old style replication is 
only used in recovery, or when you start a new replica machine and it has to 
'catchup' to the other replicas.

> 
> I have a few more questions but I wanted to get these basic ones out of the
> way first .. I would appreciate any response.

Fire away.

> 
> Thanks
> Adeel

- Mark Miller
lucidimagination.com