Re: Large number of collections in SolrCloud
We have similar date and language based collection. We also ran into similar issues of having huge clusterstate.json file which took an eternity to load up. In our case the search cases were language specific so we moved to multiple solr cluster each having a different zk namespace per language, something you might look at. On 27 Jul 2015 20:47, Olivier olivau...@gmail.com wrote: Hi, I have a SolrCloud cluster with 3 nodes : 3 shards per node and replication factor at 3. The collections number is around 1000. All the collections use the same Zookeeper configuration. So when I create each collection, the ZK configuration is pulled from ZK and the configuration files are stored in the JVM. I thought that if the configuration was the same for each collection, the impact on the JVM would be insignifiant because the configuration should be loaded only once. But it is not the case, for each collection created, the JVM size increases because the configuration is loaded again, am I correct ? If I have a small configuration folder size, I have no problem because the folder size is less than 500 KB so if we count 1000 collections x 500 KB, the JVM impact is 500 MB. But we manage a lot of languages with some dictionaries so the configuration folder size is about 6 MB. The JVM impact is very important now because it can be more than 6 GB (1000 x 6 MB). So I would like to have the feeback of people who have a cluster with a large number of collections too. Do I have to change some settings to handle this case better ? What can I do to optimize this behaviour ? For now, we just increase the RAM size per node at 16 GB but we plan to increase the collections number. Thanks, Olivier
Re: Solr Cluster management having too many cores
Hey *Shawn*, *Erik*, I's wondering if there is a JIRA story for splitting the current clusterstate.json to collection specific clusterstate config that I can track. I looked around a bit but couldn't get my hands on anything useful on that. On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey s...@elyograg.org wrote: On 4/28/2014 5:05 AM, Mukesh Jha wrote: Thanks Erik, Sounds about right. BTW how long can I keep adding collections i.e. can I keep 5/10 years data like this? Also what do you think of bullet 2) of having collection specific configurations in zookeeper? Regarding bullet 2, there is work underway right now to create a separate clusterstate within zookeeper for each collection. I do not know how far along that work is. There are no hard limits in SolrCloud at all. The things that will cause issues with scalability are resource-related problems. You'll exceed the 1MB default limit on a zookeeper database pretty quickly. If you're not using the example jetty included with Solr, you'll exceed the default maxThreads on most servlet containers very quickly. You may run into problems with the default limits on Solr's HttpShardHandler. Running hundreds or thousands of cores efficiently will require lots of RAM, both for the OS disk cache and the java heap. A large java heap will require significant tuning of Java garbage collection parameters. Most operating systems limit a user to 1024 open files and 1024 running processes (which includes threads). These limits will need to be increased. There may be other limits imposed by the Solr config, Java, and/or the operating system that I have not thought of or stated here. Thanks, Shawn -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Re: Solr Cluster management having too many cores
Looks like https://issues.apache.org/jira/browse/SOLR-5473 is the story :) On Fri, Aug 8, 2014 at 9:30 PM, Mukesh Jha me.mukesh@gmail.com wrote: Hey *Shawn*, *Erik*, I's wondering if there is a JIRA story for splitting the current clusterstate.json to collection specific clusterstate config that I can track. I looked around a bit but couldn't get my hands on anything useful on that. On Mon, Apr 28, 2014 at 7:43 AM, Shawn Heisey s...@elyograg.org wrote: On 4/28/2014 5:05 AM, Mukesh Jha wrote: Thanks Erik, Sounds about right. BTW how long can I keep adding collections i.e. can I keep 5/10 years data like this? Also what do you think of bullet 2) of having collection specific configurations in zookeeper? Regarding bullet 2, there is work underway right now to create a separate clusterstate within zookeeper for each collection. I do not know how far along that work is. There are no hard limits in SolrCloud at all. The things that will cause issues with scalability are resource-related problems. You'll exceed the 1MB default limit on a zookeeper database pretty quickly. If you're not using the example jetty included with Solr, you'll exceed the default maxThreads on most servlet containers very quickly. You may run into problems with the default limits on Solr's HttpShardHandler. Running hundreds or thousands of cores efficiently will require lots of RAM, both for the OS disk cache and the java heap. A large java heap will require significant tuning of Java garbage collection parameters. Most operating systems limit a user to 1024 open files and 1024 running processes (which includes threads). These limits will need to be increased. There may be other limits imposed by the Solr config, Java, and/or the operating system that I have not thought of or stated here. Thanks, Shawn -- Thanks Regards, * Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Re: Solr Cluster management having too many cores
Thanks Erik, Sounds about right. BTW how long can I keep adding collections i.e. can I keep 5/10 years data like this? Also what do you think of bullet 2) of having collection specific configurations in zookeeper? On Fri, Apr 25, 2014 at 11:44 PM, Erick Erickson erickerick...@gmail.comwrote: So you're talking about 700 or so collections. That should be do-able, especially as Solr is rapidly evolving to handle more and more collections and there's two years for that to happen. The aging out bit is manual (well, you'd script it I suppose). So every day there'd be a script that ran and just knew the right collection to change the alias on, there's nothing automatic yet. Best, Erick On Fri, Apr 25, 2014 at 9:37 AM, Mukesh Jha me.mukesh@gmail.com wrote: Thanks for quick reply Erik, I want to keep my collections till I run out of hardware, which is at least a couple of years worth data. I'd like to know more on ageing out aliases, did a quick search but didn't find much. On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, tell us a little more about your use-case. In particular, how long do you need to keep the data around? Days? Months? Years? Because if you only need to keep the data for a specified period, you can use the collection aliasing process to age-out collections and keep the number of cores from growing too large. Best, Erick On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hi Experts, I need to divide my indexes based on hour/day with each index having ~50-80 GB data ~50-80 mill docs, so I'm planning to create daily collection with names e.g. *sample_colledction__mm_dd_hh.* I'll also create an alias *sample_collection* and update it whenever I will create a new collection so that the entire data set is searchable. I've a couple of question on the above design 1) How far can it scale? As my collections will increase (so will the shards replicas) do we have a breaking point when adding more/searching will become an issue? 2) As my cluster will grow because of huge number of collections the clusterstate.json file present in zookeeper will grow too, won't this be a limiting factor? If so instead of storing all this info in one clusterstate.json file shouldn't Solr save cluster specific details in this file have collection specific config files present on zookeeper? 3) How can I easily manage all these collections? Do we have Java Coreadmin API's available. I cannot find much documented on it. -- Txz, *Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Solr Cluster management having too many cores
Hi Experts, I need to divide my indexes based on hour/day with each index having ~50-80 GB data ~50-80 mill docs, so I'm planning to create daily collection with names e.g. *sample_colledction__mm_dd_hh.* I'll also create an alias *sample_collection* and update it whenever I will create a new collection so that the entire data set is searchable. I've a couple of question on the above design 1) How far can it scale? As my collections will increase (so will the shards replicas) do we have a breaking point when adding more/searching will become an issue? 2) As my cluster will grow because of huge number of collections the clusterstate.json file present in zookeeper will grow too, won't this be a limiting factor? If so instead of storing all this info in one clusterstate.json file shouldn't Solr save cluster specific details in this file have collection specific config files present on zookeeper? 3) How can I easily manage all these collections? Do we have Java Coreadmin API's available. I cannot find much documented on it. -- Txz, *Mukesh Jha me.mukesh@gmail.com*
Re: Solr Cluster management having too many cores
Thanks for quick reply Erik, I want to keep my collections till I run out of hardware, which is at least a couple of years worth data. I'd like to know more on ageing out aliases, did a quick search but didn't find much. On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, tell us a little more about your use-case. In particular, how long do you need to keep the data around? Days? Months? Years? Because if you only need to keep the data for a specified period, you can use the collection aliasing process to age-out collections and keep the number of cores from growing too large. Best, Erick On Fri, Apr 25, 2014 at 6:49 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hi Experts, I need to divide my indexes based on hour/day with each index having ~50-80 GB data ~50-80 mill docs, so I'm planning to create daily collection with names e.g. *sample_colledction__mm_dd_hh.* I'll also create an alias *sample_collection* and update it whenever I will create a new collection so that the entire data set is searchable. I've a couple of question on the above design 1) How far can it scale? As my collections will increase (so will the shards replicas) do we have a breaking point when adding more/searching will become an issue? 2) As my cluster will grow because of huge number of collections the clusterstate.json file present in zookeeper will grow too, won't this be a limiting factor? If so instead of storing all this info in one clusterstate.json file shouldn't Solr save cluster specific details in this file have collection specific config files present on zookeeper? 3) How can I easily manage all these collections? Do we have Java Coreadmin API's available. I cannot find much documented on it. -- Txz, *Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Tipping point of solr shards (Num of docs / size)
Hi Gurus, In my solr cluster I've multiple shards and each shard containing ~500,000,000 documents total index size being ~1 TB. I was just wondering how much more can I keep on adding to the shard before we reach a tipping point and the performance starts to degrade? Also as best practice what is the recomended no of docs / size of shards . Txz in advance :) -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Re: Tipping point of solr shards (Num of docs / size)
My index size per shard varies b/w 250 GB to 1 TB. The cluster is performing well even now but thought it's high time to change it, so that a shard doesn't get too big On Wed, Apr 16, 2014 at 10:25 AM, Vinay Pothnis poth...@gmail.com wrote: You could look at this link to understand about the factors that affect the solrcloud performance: http://wiki.apache.org/solr/SolrPerformanceProblems Especially, the sections about RAM and disk cache. If the index grows too big for one node, it can lead to performance issues. From the looks of it, 500mil docs per shard - may be already pushing it. How much does that translate to in terms of index size on disk per shard? -vinay On 15 April 2014 21:44, Mukesh Jha me.mukesh@gmail.com wrote: Hi Gurus, In my solr cluster I've multiple shards and each shard containing ~500,000,000 documents total index size being ~1 TB. I was just wondering how much more can I keep on adding to the shard before we reach a tipping point and the performance starts to degrade? Also as best practice what is the recomended no of docs / size of shards . Txz in advance :) -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Custome routing based on date field
Hello Experts, I want to index my documents in a way that all documents for a day are stored in a single shard. I am planning to have shards for each day e.g. shard1_01_01_2010, shard1_02_01_2010 ... And while hashing the documents of 01/01/2010 should go to shard1_01_01_2010. Thins way I can query a specific shard for my documents of a given date, also I can just delete the shards for date older than some data. For this I tried using date*!*docId as my hashing param but it calculates the hash on the date feild and assigns a shard (based on which shard is assigned for that hash) to it which is not what i desire to have. Is this possible using the current solr-cloud? -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Re: Custome routing based on date field
Aliases are meant for read operations can refer to one or more real collections. So should I go with the approach of creating a collection for per day's data and aliasing a collection with all these collection names? So instead of trying to route the documents to a shard should I send to a specific collection? The problem I'm facing is even with routing the documents using date!id one shard contains docs from other date ranges too. On Mon, Apr 14, 2014 at 4:53 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: Would collection aliasing be a relevant feature here (a different approach): http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/ Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, Apr 14, 2014 at 6:05 PM, Mukesh Jha me.mukesh@gmail.com wrote: Hello Experts, I want to index my documents in a way that all documents for a day are stored in a single shard. I am planning to have shards for each day e.g. shard1_01_01_2010, shard1_02_01_2010 ... And while hashing the documents of 01/01/2010 should go to shard1_01_01_2010. Thins way I can query a specific shard for my documents of a given date, also I can just delete the shards for date older than some data. For this I tried using date*!*docId as my hashing param but it calculates the hash on the date feild and assigns a shard (based on which shard is assigned for that hash) to it which is not what i desire to have. Is this possible using the current solr-cloud? -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com* -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*