Re: Large number of collections in SolrCloud

2015-08-03 Thread Mukesh Jha
We have similar date and language based collection.
We also ran into similar issues of having huge clusterstate.json file which
took an eternity to load up.

In our case the search cases were language specific so we moved to multiple
solr cluster each having a different zk namespace per language, something
you might look at.
On 27 Jul 2015 20:47, Olivier olivau...@gmail.com wrote:

 Hi,

 I have a SolrCloud cluster with 3 nodes :  3 shards per node and
 replication factor at 3.
 The collections number is around 1000. All the collections use the same
 Zookeeper configuration.
 So when I create each collection, the ZK configuration is pulled from ZK
 and the configuration files are stored in the JVM.
 I thought that if the configuration was the same for each collection, the
 impact on the JVM would be insignifiant because the configuration should be
 loaded only once. But it is not the case, for each collection created, the
 JVM size increases because the configuration is loaded again, am I correct
 ?

 If I have a small configuration folder size, I have no problem because the
 folder size is less than 500 KB so if we count 1000 collections x 500 KB,
 the JVM impact is 500 MB.
 But we manage a lot of languages with some dictionaries so the
 configuration folder size is about 6 MB. The JVM impact is very important
 now because it can be more than 6 GB (1000 x 6 MB).

 So I would like to have the feeback of people who have a cluster with a
 large number of collections too. Do I have to change some settings to
 handle this case better ? What can I do to optimize this behaviour ?
 For now, we just increase the RAM size per node at 16 GB but we plan to
 increase the collections number.

 Thanks,

 Olivier



Re: Solr Cluster management having too many cores

2014-08-08 Thread Mukesh Jha
Hey *Shawn*, *Erik*,

I's wondering if there is a JIRA story for splitting the current
clusterstate.json to collection specific clusterstate config that I can
track.
I looked around a bit but couldn't get my hands on anything useful on that.


On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey s...@elyograg.org wrote:

 On 4/28/2014 5:05 AM, Mukesh Jha wrote:
  Thanks Erik,
 
  Sounds about right.
 
  BTW how long can I keep adding collections i.e. can I keep 5/10 years
 data
  like this?
 
  Also what do you think of bullet 2) of having collection specific
  configurations in zookeeper?

 Regarding bullet 2, there is work underway right now to create a
 separate clusterstate within zookeeper for each collection.  I do not
 know how far along that work is.

 There are no hard limits in SolrCloud at all.  The things that will
 cause issues with scalability are resource-related problems.  You'll
 exceed the 1MB default limit on a zookeeper database pretty quickly.  If
 you're not using the example jetty included with Solr, you'll exceed the
 default maxThreads on most servlet containers very quickly.  You may run
 into problems with the default limits on Solr's HttpShardHandler.

 Running hundreds or thousands of cores efficiently will require lots of
 RAM, both for the OS disk cache and the java heap.  A large java heap
 will require significant tuning of Java garbage collection parameters.

 Most operating systems limit a user to 1024 open files and 1024 running
 processes (which includes threads).  These limits will need to be
 increased.

 There may be other limits imposed by the Solr config, Java, and/or the
 operating system that I have not thought of or stated here.

 Thanks,
 Shawn




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: Solr Cluster management having too many cores

2014-08-08 Thread Mukesh Jha
Looks like https://issues.apache.org/jira/browse/SOLR-5473 is the story :)


On Fri, Aug 8, 2014 at 9:30 PM, Mukesh Jha me.mukesh@gmail.com wrote:

 Hey *Shawn*, *Erik*,

 I's wondering if there is a JIRA story for splitting the current
 clusterstate.json to collection specific clusterstate config that I can
 track.
 I looked around a bit but couldn't get my hands on anything useful on that.


 On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey s...@elyograg.org wrote:

 On 4/28/2014 5:05 AM, Mukesh Jha wrote:
  Thanks Erik,
 
  Sounds about right.
 
  BTW how long can I keep adding collections i.e. can I keep 5/10 years
 data
  like this?
 
  Also what do you think of bullet 2) of having collection specific
  configurations in zookeeper?

 Regarding bullet 2, there is work underway right now to create a
 separate clusterstate within zookeeper for each collection.  I do not
 know how far along that work is.

 There are no hard limits in SolrCloud at all.  The things that will
 cause issues with scalability are resource-related problems.  You'll
 exceed the 1MB default limit on a zookeeper database pretty quickly.  If
 you're not using the example jetty included with Solr, you'll exceed the
 default maxThreads on most servlet containers very quickly.  You may run
 into problems with the default limits on Solr's HttpShardHandler.

 Running hundreds or thousands of cores efficiently will require lots of
 RAM, both for the OS disk cache and the java heap.  A large java heap
 will require significant tuning of Java garbage collection parameters.

 Most operating systems limit a user to 1024 open files and 1024 running
 processes (which includes threads).  These limits will need to be
 increased.

 There may be other limits imposed by the Solr config, Java, and/or the
 operating system that I have not thought of or stated here.

 Thanks,
 Shawn




 --


 Thanks  Regards,

 * Mukesh Jha me.mukesh@gmail.com*




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: Solr Cluster management having too many cores

2014-04-28 Thread Mukesh Jha
Thanks Erik,

Sounds about right.

BTW how long can I keep adding collections i.e. can I keep 5/10 years data
like this?

Also what do you think of bullet 2) of having collection specific
configurations in zookeeper?


On Fri, Apr 25, 2014 at 11:44 PM, Erick Erickson erickerick...@gmail.comwrote:

 So you're talking about 700 or so collections. That should be do-able,
 especially as Solr is rapidly evolving to handle more and more
 collections and there's two years for that to happen.

 The aging out bit is manual (well, you'd script it I suppose). So
 every day there'd be a script that ran and just knew the right
 collection to change the alias on, there's nothing automatic yet.

 Best,
 Erick

 On Fri, Apr 25, 2014 at 9:37 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:
  Thanks for quick reply Erik,
 
  I want to keep my collections till I run out of hardware, which is at
 least
  a couple of years worth data.
  I'd like to know more on ageing out aliases, did a quick search but
 didn't
  find much.
 
 
  On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Hmmm, tell us a little more about your use-case. In particular, how
  long do you need to keep the data around? Days? Months? Years?
 
  Because if you only need to keep the data for a specified period, you
  can use the collection aliasing process to age-out collections and
  keep the number of cores from growing too large.
 
  Best,
  Erick
 
  On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha me.mukesh@gmail.com
  wrote:
   Hi Experts,
  
   I need to divide my indexes based on hour/day with each index having
  ~50-80
   GB data  ~50-80 mill docs, so I'm planning to create daily collection
  with
   names e.g. *sample_colledction__mm_dd_hh.*
   I'll also create an alias *sample_collection* and update it whenever I
  will
   create a new collection so that the entire data set is searchable.
  
   I've a couple of question on the above design
   1) How far can it scale? As my collections will increase (so will the
   shards  replicas) do we have a breaking point when adding
 more/searching
   will become an issue?
   2) As my cluster will grow because of huge number of collections the
   clusterstate.json file present in zookeeper will grow too, won't this
 be
  a
   limiting factor? If so instead of storing all this info in one
   clusterstate.json file shouldn't Solr save cluster specific details in
  this
   file  have collection specific config files present on zookeeper?
   3) How can I easily manage all these collections? Do we have Java
  Coreadmin
   API's available. I cannot find much documented on it.
  
   --
   Txz,
  
   *Mukesh Jha me.mukesh@gmail.com*
 
 
 
 
  --
 
 
  Thanks  Regards,
 
  *Mukesh Jha me.mukesh@gmail.com*




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Solr Cluster management having too many cores

2014-04-25 Thread Mukesh Jha
Hi Experts,

I need to divide my indexes based on hour/day with each index having ~50-80
GB data  ~50-80 mill docs, so I'm planning to create daily collection with
names e.g. *sample_colledction__mm_dd_hh.*
I'll also create an alias *sample_collection* and update it whenever I will
create a new collection so that the entire data set is searchable.

I've a couple of question on the above design
1) How far can it scale? As my collections will increase (so will the
shards  replicas) do we have a breaking point when adding more/searching
will become an issue?
2) As my cluster will grow because of huge number of collections the
clusterstate.json file present in zookeeper will grow too, won't this be a
limiting factor? If so instead of storing all this info in one
clusterstate.json file shouldn't Solr save cluster specific details in this
file  have collection specific config files present on zookeeper?
3) How can I easily manage all these collections? Do we have Java Coreadmin
API's available. I cannot find much documented on it.

-- 
Txz,

*Mukesh Jha me.mukesh@gmail.com*


Re: Solr Cluster management having too many cores

2014-04-25 Thread Mukesh Jha
Thanks for quick reply Erik,

I want to keep my collections till I run out of hardware, which is at least
a couple of years worth data.
I'd like to know more on ageing out aliases, did a quick search but didn't
find much.


On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm, tell us a little more about your use-case. In particular, how
 long do you need to keep the data around? Days? Months? Years?

 Because if you only need to keep the data for a specified period, you
 can use the collection aliasing process to age-out collections and
 keep the number of cores from growing too large.

 Best,
 Erick

 On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha me.mukesh@gmail.com
 wrote:
  Hi Experts,
 
  I need to divide my indexes based on hour/day with each index having
 ~50-80
  GB data  ~50-80 mill docs, so I'm planning to create daily collection
 with
  names e.g. *sample_colledction__mm_dd_hh.*
  I'll also create an alias *sample_collection* and update it whenever I
 will
  create a new collection so that the entire data set is searchable.
 
  I've a couple of question on the above design
  1) How far can it scale? As my collections will increase (so will the
  shards  replicas) do we have a breaking point when adding more/searching
  will become an issue?
  2) As my cluster will grow because of huge number of collections the
  clusterstate.json file present in zookeeper will grow too, won't this be
 a
  limiting factor? If so instead of storing all this info in one
  clusterstate.json file shouldn't Solr save cluster specific details in
 this
  file  have collection specific config files present on zookeeper?
  3) How can I easily manage all these collections? Do we have Java
 Coreadmin
  API's available. I cannot find much documented on it.
 
  --
  Txz,
 
  *Mukesh Jha me.mukesh@gmail.com*




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Mukesh Jha
Hi Gurus,

In my solr cluster I've multiple shards and each shard containing
~500,000,000 documents total index size being ~1 TB.

I was just wondering how much more can I keep on adding to the shard before
we reach a tipping point and the performance starts to degrade?

Also as best practice what is the recomended no of docs / size of shards .

Txz in advance :)

-- 
Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: Tipping point of solr shards (Num of docs / size)

2014-04-15 Thread Mukesh Jha
My index size per shard varies b/w 250 GB to 1 TB.
The cluster is performing well even now but thought it's high time to
change it, so that a shard doesn't get too big


On Wed, Apr 16, 2014 at 10:25 AM, Vinay Pothnis poth...@gmail.com wrote:

 You could look at this link to understand about the factors that affect the
 solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems

 Especially, the sections about RAM and disk cache. If the index grows too
 big for one node, it can lead to performance issues. From the looks of it,
 500mil docs per shard - may be already pushing it. How much does that
 translate to in terms of index size on disk per shard?

 -vinay


 On 15 April 2014 21:44, Mukesh Jha me.mukesh@gmail.com wrote:

  Hi Gurus,
 
  In my solr cluster I've multiple shards and each shard containing
  ~500,000,000 documents total index size being ~1 TB.
 
  I was just wondering how much more can I keep on adding to the shard
 before
  we reach a tipping point and the performance starts to degrade?
 
  Also as best practice what is the recomended no of docs / size of shards
 .
 
  Txz in advance :)
 
  --
  Thanks  Regards,
 
  *Mukesh Jha me.mukesh@gmail.com*
 




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Custome routing based on date field

2014-04-14 Thread Mukesh Jha
Hello Experts,

I want to index my documents in a way that all documents for a day are
stored in a single shard.

I am planning to have shards for each day e.g. shard1_01_01_2010,
shard1_02_01_2010 ...
And while hashing the documents of 01/01/2010 should go to
shard1_01_01_2010.

Thins way I can query a specific shard for my documents of a given date,
also I can just delete the shards for date older than some data.

For this I tried using date*!*docId as my hashing param but it calculates
the hash on the date feild and assigns a shard (based on which shard is
assigned for that hash) to it which is not what i desire to have.

Is this possible using the current solr-cloud?

-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*


Re: Custome routing based on date field

2014-04-14 Thread Mukesh Jha
Aliases are meant for read operations can refer to one or more real
collections.

So should I go with the approach of creating a collection for per day's
data and aliasing a collection with all these collection names?

So instead of trying to route the documents to a shard should I send to a
specific collection?

The problem I'm facing is even with routing the documents using date!id one
shard contains docs from other date ranges too.


On Mon, Apr 14, 2014 at 4:53 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Would collection aliasing be a relevant feature here (a different
 approach):
 http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Mon, Apr 14, 2014 at 6:05 PM, Mukesh Jha me.mukesh@gmail.com
 wrote:
  Hello Experts,
 
  I want to index my documents in a way that all documents for a day are
  stored in a single shard.
 
  I am planning to have shards for each day e.g. shard1_01_01_2010,
  shard1_02_01_2010 ...
  And while hashing the documents of 01/01/2010 should go to
  shard1_01_01_2010.
 
  Thins way I can query a specific shard for my documents of a given date,
  also I can just delete the shards for date older than some data.
 
  For this I tried using date*!*docId as my hashing param but it calculates
  the hash on the date feild and assigns a shard (based on which shard is
  assigned for that hash) to it which is not what i desire to have.
 
  Is this possible using the current solr-cloud?
 
  --
 
 
  Thanks  Regards,
 
  *Mukesh Jha me.mukesh@gmail.com*




-- 


Thanks  Regards,

*Mukesh Jha me.mukesh@gmail.com*