RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni
Good write up. 


And what about node?

I think there needs to be an official glossary of terms that is sanctioned by the solr 
team and some terms still ni use may need to be labeled deprecated. After so 
many years, its still confusing.

brbrbr--- Original Message ---
On 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with each 
brcore being a replica of the other cores within that shard of that 
brcollection.

br
brInstance is a general term, but is commonly used to refer to a running Solr 
brserver, each of which can service any number of cores. A sharded collection 
brwould typically require multiple instances of Solr, each with a shard of the 
brcollection.

br
brMultiple collections can be supported on a single instance of Solr. They 
brdon't have to be sharded or replicated. But if they are, each Solr instance 
brwill have a copy or replica of the data (index) of one shard of each sharded 
brcollection - to the degree that each collection needs that many shards.

br
brAt the API level, you talk to a Solr instance, using a host and port, and 
brgiving the collection name. Some operations will refer only to the portion 
brof a multi-shard collection on that Solr instance, but typically Solr will 
brdistribute the operation, whether it be an update or a query, to all of 
brthe shards of the named collection. In the case of update, the update will 
brbe distributed to all replicas as well, but in the case of query only one 
brreplica of each shard of the collection is needed.

br
brBefore SolrCloud we Solr had master and slave and the slaves were replicas 
brof the master, but with SolrCloud there is no master and all the replicas of 
brthe shard are peers, although at any moment of time one of them will be 
brconsidered the leader for coordination purposes, but not in the sense that 
brit is a master of the other replicas in that shard. A SolrCloud replica is a 
brreplica of the data, in an abstract sense, for a single shard of a 
brcollection. A SolrCloud replica is more of an instance of the data/index.

br
brAn index exists at two levels: the portion of a collection on a single Solr 
brcore will have a Lucene index, but collectively the Lucene indexes for the 
brshards of a collection can be referred to the index of the collection. Each 
brreplica is a copy or instance of a portion of the collection's index.

br
brThe term slice is sometimes used to refer collectively to all of the 
brcores/replicas of a single shard, or sometimes to a single replica as it 
brcontains only a slice of the full collection data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Alexandre Rafalovitch

brSent: Thursday, January 03, 2013 4:42 AM
brTo: solr-user@lucene.apache.org
brSubject: Terminology question: Core vs. Collection vs...
br
brHello,
br
brI am trying to understand the core Solr terminology. I am looking for
brcorrect rather than loose meaning as I am trying to teach an example that
brstarts from easy scenario and may scale to multi-core, multi-machine
brsituation.
br
brHere are the terms that seem to be all overlapping and/or crossing over in
brmy mind a the moment.
br
br1) Index
br2) Core
br3) Collection
br4) Instance
br5) Replica (Replica of _what_?)
br6) Others?
br
brI tried looking through documentation, but either there is a terminology
brdrift or I am having trouble understanding the distinctions.
br
brIf anybody has a clear picture in their mind, I would appreciate a
brclarification.
br
brRegards,
br   Alex.
br
brPersonal blog: http://blog.outerthoughts.com/
brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
br- Time is the quality of nature that keeps events from happening all at
bronce. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book) 
br

br


RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
  brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with each 
brcore being a replica of the other cores within that shard of that
brcollection. 


A collection is sharded, meaning it is distributed across cores. A shard itself 
is not distributed across cores in the same since. Rather a shard exist on a 
single core and is replicated on other cores. Is that right? The way its worded 
above, it sounds like a shard can also be sharded...


brbrbr--- Original Message ---
On 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real 
brmachine or a virtualized machine. Technically, you could have multiple 
brvirtual nodes on the same physical box. Each Solr replica would be on a 
brdifferent node.

br
brTechnically, you could have multiple Solr instances running on a single 
brhardware node, each with a different port. They are simply instances of 
brSolr, although you could consider each Solr instance a node in a Solr cloud 
bras well, a virtual node. So, technically, you could have multiple replicas 
bron the same node, but that sort of defeats most of the purpose of having 
brreplicas in the first place - to distribute the data for performance and 
brfault tolerance. But, you could have replicas of different shards on the 
brsame node/box for a partial improvement of performance and fault tolerance.

br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is sanctioned 
brby the solr team and some terms still ni use may need to be labeled 
brdeprecated. After so many years, its still confusing.

br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern 
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with 
breach

brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a running 
brSolr
brbrserver, each of which can service any number of cores. A sharded 
brcollection
brbrwould typically require multiple instances of Solr, each with a shard of 
brthe

brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr. They
brbrdon't have to be sharded or replicated. But if they are, each Solr 
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each 
brsharded

brbrcollection - to the degree that each collection needs that many shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and port, 
brand
brbrgiving the collection name. Some operations will refer only to the 
brportion
brbrof a multi-shard collection on that Solr instance, but typically Solr 
brwill
brbrdistribute the operation, whether it be an update or a query, to all 
brof
brbrthe shards of the named collection. In the case of update, the update 
brwill
brbrbe distributed to all replicas as well, but in the case of query only 
brone

brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were 
brreplicas
brbrof the master, but with SolrCloud there is no master and all the 
brreplicas of

brbrthe shard are peers, although at any moment of time one of them will be
brbrconsidered the leader for coordination purposes, but not in the sense 
brthat
brbrit is a master of the other replicas in that shard. A SolrCloud replica 
bris a

brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud replica is more of an instance of the 
brdata/index.

brbr
brbrAn index exists at two levels: the portion of a collection on a single 
brSolr
brbrcore will have a Lucene index, but collectively the Lucene indexes for 
brthe
brbrshards of a collection can be referred to the index of the collection. 
brEach

brbrreplica is a copy or instance of a portion of the collection's index.
brbr
brbrThe term slice is sometimes used to refer collectively to all of the
brbrcores/replicas of a single shard, or sometimes to a single replica as it
brbrcontains only a slice of the full collection data.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Alexandre Rafalovitch

brbrSent: Thursday, January 03, 2013 4:42 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: Terminology question: Core vs. Collection vs...
brbr
brbrHello,
brbr
brbrI am trying

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? 


brbrbr--- Original Message ---
On 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of 
brslicing the original data, before we talk about how the shards get stored 
brand replicated on actual Solr cores. Replicas are instances of the data for 
bra shard.

br
brSometimes people may loosely speak of a replica as being a shard, but 
brthat's just loose use of the terminology.

br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the 
brbrcollection may be sharded, with each shard on one or more cores, with 
breach brcore being a replica of the other cores within that shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard 
britself is not distributed across cores in the same since. Rather a shard 
brexist on a single core and is replicated on other cores. Is that right? The 
brway its worded above, it sounds like a shard can also be sharded...

br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a 
brcluster or cloud (graph). It could be a real

brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on 
bra

brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr 
brcloud
brbras well, a virtual node. So, technically, you could have multiple 
brreplicas

brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault 
brtolerance.

brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is 
brsanctioned

brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more 
brmodern

brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores, 
brwith

brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a 
brrunning

brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a 
brshard of

brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr. 
brThey

brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many 
brshards.

brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and 
brport,

brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically 
brSolr

brbrwill
brbrbrdistribute the operation, whether it be an update or a query, to 
brall

brbrof
brbrbrthe shards of the named collection. In the case of update, the 
brupdate

brbrwill
brbrbrbe distributed to all replicas as well, but in the case of query 
bronly

brbrone
brbrbrreplica of each shard of the collection is needed.
brbrbr
brbrbrBefore SolrCloud we Solr had master and slave and the slaves were
brbrreplicas
brbrbrof the master, but with SolrCloud there is no master and all the
brbrreplicas of
brbrbrthe shard are peers, although at any moment of time one of them will 
brbe
brbrbrconsidered the leader

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

I think what's confusing about your explanation below is when you have a 
situation where there is no replication factor. That's possible too, yes?

So in that case, is each core of a shard of a collection, still referred to as a replica? 


To me a replica is a duplicate/backup of a shard's core. Not the sharded core 
itself. Or is there just no difference. Even a non-replicated core is called a 
replica?


brbrbr--- Original Message ---
On 1/3/2013  09:08 AM Jack Krupansky wrote:brOops... let me word that a 
little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr
brbrwill
brbrbrdistribute the operation, whether it be an update

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Yes. And its worth to note that when having multiple shards in a single 
node(@deprecated) that they are shards of different collections...

brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an 
brinstance of a Solr server.

br
brAnd, technically, you can have multiple shards in a single instance of Solr, 
brseparating the logical sharding of keys from the distribution of the data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Ah, ok. Good. Makes sense.

I think I will draw all this up in a UML that includes the distinction between the 
logical terms and the physical terms (and their mapping) as they do get 
intertwined. I'll post it here when I'm done.

brbrbr--- Original Message ---
On 1/3/2013  09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated. 
brGenerally, a single shard will exist on multiple cores, each a replica of 
brthe source data as it comes into the update handler.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 9:10 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks. I got that part.
br
brA group of shards (and therefore cores) represent a collection, yes. But a 
brsingle shard exist only on a single core?

br
brbrbrbr--- Original Message ---
brOn 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or 
brslice) of the collection. Sharding is a way of
brbrslicing the original data, before we talk about how the shards get 
brstored
brbrand replicated on actual Solr cores. Replicas are instances of the data 
brfor

brbra shard.
brbr
brbrSometimes people may loosely speak of a replica as being a shard, but
brbrthat's just loose use of the terminology.
brbr
brbrSo, we're not sharding shards, but we are replicating shards.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:51 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrThanks again. (And sorry to jump into this convo)
brbr
brbrBut I had a question on your statement:
brbr
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:
brbr   brCollection is the more modern term and incorporates the fact that 
brthe
brbrbrcollection may be sharded, with each shard on one or more cores, 
brwith
brbreach brcore being a replica of the other cores within that shard of 
brthat

brbrbrcollection.
brbr
brbrA collection is sharded, meaning it is distributed across cores. A shard
brbritself is not distributed across cores in the same since. Rather a shard
brbrexist on a single core and is replicated on other cores. Is that right? 
brThe

brbrway its worded above, it sounds like a shard can also be sharded...
brbr
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brbrcluster or cloud (graph). It could be a real
brbrbrmachine or a virtualized machine. Technically, you could have 
brmultiple
brbrbrvirtual nodes on the same physical box. Each Solr replica would be 
bron

brbra
brbrbrdifferent node.
brbrbr
brbrbrTechnically, you could have multiple Solr instances running on a 
brsingle
brbrbrhardware node, each with a different port. They are simply instances 
brof
brbrbrSolr, although you could consider each Solr instance a node in a 
brSolr

brbrcloud
brbrbras well, a virtual node. So, technically, you could have multiple
brbrreplicas
brbrbron the same node, but that sort of defeats most of the purpose of 
brhaving
brbrbrreplicas in the first place - to distribute the data for performance 
brand
brbrbrfault tolerance. But, you could have replicas of different shards on 
brthe

brbrbrsame node/box for a partial improvement of performance and fault
brbrtolerance.
brbrbr
brbrbrA Solr cloud' is really a cluster.
brbrbr
brbrbr-- Jack Krupansky
brbrbr
brbrbr-Original Message- 
brbrbrFrom: Darren Govoni

brbrbrSent: Thursday, January 03, 2013 8:16 AM
brbrbrTo: solr-user@lucene.apache.org
brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbrbr
brbrbrGood write up.
brbrbr
brbrbrAnd what about node?
brbrbr
brbrbrI think there needs to be an official glossary of terms that is
brbrsanctioned
brbrbrby the solr team and some terms still ni use may need to be labeled
brbrbrdeprecated. After so many years, its still confusing.
brbrbr
brbrbrbrbrbr--- Original Message ---
brbrbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the 
brmore

brbrmodern
brbrbrterm and incorporates the fact that the
brbrbrbrcollection may be sharded, with each shard on one or more cores,
brbrwith
brbrbreach
brbrbrbrcore being a replica of the other cores within that shard of 
brthat

brbrbrbrcollection.
brbrbrbr
brbrbrbrInstance is a general term, but is commonly used to refer to a
brbrrunning
brbrbrSolr
brbrbrbrserver, each of which can service any number of cores. A sharded
brbrbrcollection
brbrbrbrwould typically require multiple instances of Solr, each with a
brbrshard of
brbrbrthe
brbrbrbrcollection.
brbrbrbr
brbrbrbrMultiple collections can be supported on a single instance of 
brSolr.

brbrThey
brbrbrbrdon't have to be sharded or replicated. But if they are, each 
brSolr

brbrbrinstance
brbrbrbrwill have a copy or replica of the data (index) of one

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Great point.

brbrbr--- Original Message ---
On 1/3/2013  10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller 
wrote:
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
brHmmm Im not sure you are right about that. Maybe more than one 
brdistributed system calls them Replica, but there is also a lot that 
brdoesnt. But if you are right, thats at least a good valid argument to do 
brit this way, even though I generally prefer good logical naming over 
brfollowing bad naming from the industry :-) Just because there is a lot 
brof crap out there, doesnt mean that we also want to make crap. Maybe 
brgood logical naming could even be a small entry in the Why Solr is 
brbetter than its competitors list :-)

br


RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

And based on the previous explanation there is never a copy of a shard. A 
shard represents and contains only replicas for itself, replicas being copies of cores 
within the shard.

brbrbr--- Original Message ---
On 1/3/2013  11:58 AM Walter Underwood wrote:brA factor is multiplied, so 
multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that 
shard.
br
brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that. 
br

brwunder
br
brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote:
br
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
br 
br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain.
br 
br - Mark
br 
br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:
br 
br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor.
br 
br Regards, Per Steffensen
br 
br On 1/3/13 3:52 PM, Per Steffensen wrote:

br Hi
br 
br Here is my version - do not believe the explanations have been very clear
br 
br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard)

br 1) Machines (virtual or physical) running Solr server JVMs (one machine 
can run several Solr server JVMs if you like)
br 2) Solr server JVMs
br 3) Logical stores where you can add/update/delete data-instances (closest to 
logical tables in RDBMS)
br 4) Logical slices of a store (closest to non-overlapping logical sets of rows 
for the logical table in a RDBMS)
br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical 
slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical 
concepts
br 
br Terminology

br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is 
called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer 
to a Solr server JVM.
br 2) Node
br 3) Collection
br 4) Shard. Used to be called Slice but I believe now it is officially called 
Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this 
logical/non-physical concept  - just needs to be reflected it across documentation and code
br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name 
Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue 
(if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately 
obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) 
contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code.
br 
br Regards, Per Steffensen
br 
br 
br

br--
brWalter Underwood
brwun...@wunderwood.org
br
br
br
br