RE: Solr hardware memory question
Thanks for this - I haven't any previous experience with utilising SSDs in the way you suggest, so I guess I need to start learning! And thanks for the Danish-webscale URL, looks like very informed reading. (Yes, I think we're working in similar industries with similar constraints and expectations). Compiliing my answers into one email, Curious how many documents per shard you were planning? The number of documents per shard and field type will drive the amount of a RAM needed to sort and facet. - Number of documents per shard, I think about 200 million. That's a bit of a rough estimate based on other Solrs we run though. Which I think means we hold a lot of data for each document, though I keep arguing to keep this to the truly required minimum. We also have many facets, some of which are pretty large (I'm stretching my understanding here but I think most documents have many 'entries' in many facets so these really hit us performance-wise.) I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for the operating system. I utilise MMapDirectory to manage memory via the OS. So at this moment I guessing that we'll have 56 Solr dedicated CPUs across 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would give 28 shards and each would have 5GB java memory (in Tomcat), leaving 126GB on each server for the OS and MMap. (I believe the Solr theory for this doesn't accurately work out but we can accept the edge cases where this will fail.) I can also see that our hardware requirements will also depend on usage as well as the volume of data, and I've been pondering how best we can structure our index/es to facilitate a long term service (which means that, given it's a lot of data, I need to structure the data so that new usage doesn't require re-indexing.) But at this early stage, as people say, we need to prototype, test, profile etc. and to do that I need the hardware to run the trials (policy dictates that I buy the production hardware now, before profiling - I get to control much of the design and construction so I don't argue with this!) Thanks for all the comments everyone, all very much appreciated :) Gil -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: 11 December 2013 12:02 To: solr-user@lucene.apache.org Subject: Re: Solr hardware memory question On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: We're probably going to be building a Solr service to handle a dataset of ~60TB, which for our data and schema typically gives a Solr index size of 1/10th - i.e., 6TB. Given there's a general rule about the amount of hardware memory required should exceed the size of the Solr index (exceed to also allow for the operating system etc.), how have people handled this situation? By acknowledging that it is cheaper to buy SSDs instead of trying to compensate for slow spinning drives with excessive amounts of RAM. Our plans for an estimated 20TB of indexes out of 372TB of raw web data is to use SSDs controlled by a single machine with 512GB of RAM (or was it 256GB? I'll have to ask the hardware guys): https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ As always YMMW and the numbers you quite elsewhere indicates that your queries are quite complex. You might want to be a bit of profiling to see if they are heavy enough to make the CPU the bottleneck. Regards, Toke Eskildsen, State and University Library, Denmark
Solr hardware memory question
We're probably going to be building a Solr service to handle a dataset of ~60TB, which for our data and schema typically gives a Solr index size of 1/10th - i.e., 6TB. Given there's a general rule about the amount of hardware memory required should exceed the size of the Solr index (exceed to also allow for the operating system etc.), how have people handled this situation? Do I really need, for example, 12 servers with 512GB RAM, or are there other techniques to handling this? Many thanks in advance for any general/conceptual/specific ideas/comments/answers! Gil Gil Hoggarth Web Archiving Technical Services Engineer The British Library, Boston Spa, West Yorkshire, LS23 7BQ
RE: Solr hardware memory question
Thanks Shawn. You're absolutely right about the performance balance, though it's good to hear it from an experienced source (if you don't mind me calling you that!) Fortunately we don't have a top performance requirement, and we have a small audience so a low query volume. On similar systems we're managing to just provide a Solr service with a 3TB index size on 160GB RAM, though we have scripts to handle the occasionally necessary service restart when someone submits a more exotic query. This, btw, gives a response time of ~45-90 seconds for uncached queries. My question I suppose comes from my hope that we can do away with the restart scripts as I doubt they help the Solr service (as they can if necessary just kill processes and restart), and get to responses times 20 seconds. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: 10 December 2013 17:37 To: solr-user@lucene.apache.org Subject: Re: Solr hardware memory question On 12/10/2013 9:51 AM, Hoggarth, Gil wrote: We're probably going to be building a Solr service to handle a dataset of ~60TB, which for our data and schema typically gives a Solr index size of 1/10th - i.e., 6TB. Given there's a general rule about the amount of hardware memory required should exceed the size of the Solr index (exceed to also allow for the operating system etc.), how have people handled this situation? Do I really need, for example, 12 servers with 512GB RAM, or are there other techniques to handling this? That really depends on what kind of query volume you'll have and what kind of performance you want. If your query volume is low and you can deal with slow individual queries, then you won't need that much memory. If either of those requirements increases, you'd probably need more memory, up to the 6TB total -- or 12TB if you need to double the total index size for redundancy purposes. If your index is constantly growing like most are, you need to plan for that too. Putting the entire index into RAM is required for *top* performance, but not for base functionality. It might be possible to put only a fraction of your index into RAM. Only testing can determine what you really need to obtain the performance you're after. Perhaps you've already done this, but you should try as much as possible to reduce your index size. Store as few fields as possible, only just enough to build a search result list/grid and retrieve the full document from the canonical data store. Save termvectors and docvalues on as few fields as possible. If you can, reduce the number of terms produced by your analysis chains. Thanks, Shawn
RE: How to work with remote solr savely?
We solved this issue outside of Solr. As you've done, restrict the server to localhost access to Solr, add firewall rules to allow your developers on port 80, and proxypass allowed port 80 transfer to Solr. Remember to include the proxypassreverse too. (This runs on linux and apache httpd btw.) -Original Message- From: Stavros Delisavas [mailto:stav...@delisavas.de] Sent: 22 November 2013 14:24 To: solr-user@lucene.apache.org Subject: How to work with remote solr savely? Hello Solr-Friends, I have a question about working with solr which is installed on a remote server. I have a php-project with a very big mysql-database of about 10gb and I am also using solr for about 10,000,000 entries indexed for fast search and access of the mysql-data. I have a local copy myself so I can continue to work on the php-project itself, but I want to make it available for more developers too. How can I make solr accessable ONLY for those exclusive developers? For mysql it's no problem to add an additional mysql-user with limited access. But for Solr it seems difficult to me. I have had my administrator restrict the java-port 8080 to localhost only. That way no one outside can access solr or the solr-admin interface. How can I allow access to other developers without making the whole solr-interface (port 8080) available to the public? Thanks, Stavros
RE: How to work with remote solr savely?
You could also use one of the proxy scripts, such as http://code.google.com/p/solr-php-client/, which is coincidentally linked (eventually) from Michael's suggested SolrSecurity URL. -Original Message- From: michael.boom [mailto:my_sky...@yahoo.com] Sent: 22 November 2013 14:53 To: solr-user@lucene.apache.org Subject: Re: How to work with remote solr savely? http://wiki.apache.org/solr/SolrSecurity#Path_Based_Authentication Maybe you could achieve write/read access limitation by setting path based authentication: The update handler /solr/core/update should be protected by authentication, with credentials only known to you. But then of course, your indexing client will need to authenticate in order to add docs to solr. Your select handler /solr/core/select could then be open or protected by http auth with credentials open to developers. That's the first idea that comes to mind - haven't tested it. If you do, feedback and let us know how it went. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-work-with-remote-solr-savely-t p4102612p4102618.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Why do people want to deploy to Tomcat?
For me, a side-affect of 'example' is that it's just that, not appropriate for production. But also, there's the organisation factor beyond Solr that is about staff expertise - we don't have any systems that utilise jetty so we're unfamiliar with its configuration, issues, or oddities. Tomcat is our defacto container so it makes sense for us to implement Solr within Tomcat. If we ruled out these reasons, I'd still be looking for a container that: - was a standalone installation (i.e., outside of Solr tarball) so that it would be managed via yum (we run on RHEL). This separates any issues of Solr from issues of jetty, which given a current lack of jetty knowledge would be a helpful thing. - the container service could be managed via standard SysV startup processes. To be fair, I've implemented our own for Tomcat and could do this for jetty, but I'd prefer jetty included this (which would suggest it is more prepared for enterprise use). - Likewise, I assume all of jetty's configuration can be reset to use normal RHEL /etc/ and /var/ directories, but I'd prefer that jetty did this for me (to demonstrate again it's enterprise-ready status). Yes, I could do all the necessary bespoke configuration so that jetty follows the above reasons, but because I'd have to I question if it's ready for our enterprise setup (which mainly means that our Operations team will fight against unusual configurations). Having added all of this, I have to admit that I like the idea of using jetty because you guys tell me that Solr is affectively pre-configured for jetty. But then I'd want to know what in particular these jetty configurations were! BTW Very pleased that this is being discussed - the views can help me argue our case to use jetty if it is indeed more beneficial to do so. Gil -Original Message- From: Sebastián Ramírez [mailto:sebastian.rami...@senseta.com] Sent: 12 November 2013 13:38 To: solr-user@lucene.apache.org Subject: Re: Why do people want to deploy to Tomcat? I agree with Doug, when I started I had to spend some time figuring out what was just an example and what I would have to change in a production environment... until I found that all the example was ready for production. Of course, you commonly have to change the settings, parameters, fields, etc. of your Solr system, but the example doesn't have anything that is not for production. Sebastián Ramírez [image: SENSETA – Capture Analyze] http://www.senseta.com/ On Tue, Nov 12, 2013 at 8:18 AM, Amit Aggarwal amit.aggarwa...@gmail.comwrote: Agreed with Doug On 12-Nov-2013 6:46 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: As an aside, I think one reason people feel compelled to deviate from the distributed jetty distribution is because the folder is named example. I've had to explain to a few clients that this is a bit of a misnomer. The IT dept especially sees example and feels uncomfortable using that as a starting point for a jetty install. I wish it was called default or bin or something where its more obviously the default jetty distribution of Solr. On Tue, Nov 12, 2013 at 7:06 AM, Roland Everaert reveatw...@gmail.com wrote: In my case, the first time I had to deploy and configure solr on tomcat (and jboss) it was a requirement to reuse as much as possible the application/web server already in place. The next deployment I also use tomcat, because I was used to deploy on tomcat and I don't know jetty at all. I could ask the same question with regard to jetty. Why use/bundle(/ if not recommend) jetty with solr over other webserver solutions? Regards, Roland Everaert. On Tue, Nov 12, 2013 at 12:33 PM, Alvaro Cabrerizo topor...@gmail.com wrote: In my case, the selection of the servlet container has never been a hard requirement. I mean, some customers provide us a virtual machine configured with java/tomcat , others have a tomcat installed and want to share it with solr, others prefer jetty because their sysadmins are used to configure it... At least in the projects I've been working in, the selection of the servlet engine has not been a key factor in the project success. Regards. On Tue, Nov 12, 2013 at 12:11 PM, Andre Bois-Crettez andre.b...@kelkoo.comwrote: We are using Solr running on Tomcat. I think the top reasons for us are : - we already have nagios monitoring plugins for tomcat that trace queries ok/error, http codes / response time etc in access logs, number of threads, jvm memory usage etc - start, stop, watchdogs, logs : we also use our standard tools for that - what about security filters ? Is that possible with jetty ? André On 11/12/2013 04:54 AM, Alexandre Rafalovitch wrote: Hello,
How to cancel a collection 'optimize'?
We have an internal Solr collection with ~1 billion documents. It's split across 24 shards and uses ~3.2TB of disk space. Unfortunately we've triggered an 'optimize' on the collection (via a restarted browser tab), which has raised the disk usage to 4.6TB, with 130GB left on the disk volume. As I fully expect Solr to use up all of the disk space as the collection is more than 50% of the disk volume, how can I cancel this optimize? And separately, if I were to reissue with maxSegments=(high number, eg 40), should I still expect the same disk usage? (I'm presuming so as doesn't it need to gather the whole index to determine which docs should go into which segments?) Solr 4.4 on RHEL6.4, 160GB RAM, 5GB per shard. (Great conference last week btw - so much to learn!) Gil Hoggarth Web Archiving Technical Services Engineer The British Library, Boston Spa, West Yorkshire, LS23 7BQ Tel: 01937 546163
RE: How to cancel a collection 'optimize'?
Hi Otis, thanks for the response. I could stop the whole Solr service as as yet there's no audience access to it, but might it be left in an incomplete state and thus try to complete optimisation when the service is restarted? [Yes, we did speak in Dublin - you can see we need that monitoring service! Must set up the demo version, asap!] -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: 11 November 2013 16:02 To: solr-user@lucene.apache.org Subject: Re: How to cancel a collection 'optimize'? Hi Gil, (we spoke in Dublin, didn't we?) Short of stopping Solr I have a feeling there isn't much you can do hm. or, I wonder if you could somehow get a thread dump, get the PID of the thread (since I believe threads in Linux are run as processes), and then kill that thread... Feels scary and I'm not sure what this might do to the index, but maybe somebody else can jump in and comment on this approach or suggest a better one. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Mon, Nov 11, 2013 at 10:44 AM, Hoggarth, Gil gil.hogga...@bl.uk wrote: We have an internal Solr collection with ~1 billion documents. It's split across 24 shards and uses ~3.2TB of disk space. Unfortunately we've triggered an 'optimize' on the collection (via a restarted browser tab), which has raised the disk usage to 4.6TB, with 130GB left on the disk volume. As I fully expect Solr to use up all of the disk space as the collection is more than 50% of the disk volume, how can I cancel this optimize? And separately, if I were to reissue with maxSegments=(high number, eg 40), should I still expect the same disk usage? (I'm presuming so as doesn't it need to gather the whole index to determine which docs should go into which segments?) Solr 4.4 on RHEL6.4, 160GB RAM, 5GB per shard. (Great conference last week btw - so much to learn!) Gil Hoggarth Web Archiving Technical Services Engineer The British Library, Boston Spa, West Yorkshire, LS23 7BQ Tel: 01937 546163
RE: New shard leaders or existing shard replicas depends on zookeeper?
Absolutely, the scenario I'm seeing does _sound_ like I've not specified the number of shards, but I think I have - the evidence is: - DnumShards=24 defined within the /etc/sysconfig/solrnode* files - DnumShards=24 seen on each 'ps' line (two nodes listed here): tomcat 26135 1 5 09:51 ?00:00:22 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/log ging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode1 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode1/ldwa01/conf -Dcollection.configName=ldwa01cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode1/ldwa01/data -DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl .uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode1 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode1/tmp org.apache.catalina.startup.Bootstrap start tomcat 26225 1 5 09:51 ?00:00:19 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/log ging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode2 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode2/ldwa01/conf -Dcollection.configName=ldwa01cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode2/ldwa01/data -DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl .uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode2 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode2/tmp org.apache.catalina.startup.Bootstrap start - The Solr node dashboard shows -DnumShards=24 in its list of Args for each node And yet, the ldwa01 nodes are leader and replica of shard 17 and there are no other shard leaders created. Plus, if I only change the ZK ensemble declarations in /etc/system/solrnode* to the different dev ZK servers, all 24 leaders are created before any replicas are added. I can also mention, when I browse the Cloud view, I can see both the ldwa01 collection and the ukdomain collection listed, suggesting that this information comes from the ZKs - I assume this is as expected. Plus, the correct node addresses (e.g., 192.168.45.17:8984) are listed for ldwa01 but these addresses are also listed as 'Down' in the ukdomain collection (except for :8983 which only shows in the ldwa01 collection). Any help very gratefully received. Gil -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 23 October 2013 18:50 To: solr-user@lucene.apache.org Subject: Re: New shard leaders or existing shard replicas depends on zookeeper? My first impulse would be to ask how you created the collection. It sure _sounds_ like you didn't specify 24 shards and thus have only a single shard, one leader and 23 replicas bq: ...to point to the zookeeper ensemble also used for the ukdomain collection... so my guess is that this ZK ensemble has the ldwa01 collection defined as having only one shard I admit I pretty much skimmed your post though... Best, Erick On Wed, Oct 23, 2013 at 12:54 PM, Hoggarth, Gil gil.hogga...@bl.uk wrote: Hi solr-users, I'm seeing some confusing behaviour in Solr/zookeeper and hope you can shed some light on what's happening/how I can correct it. We have two physical servers running automated builds of RedHat 6.4 and Solr 4.4.0 that host two separate Solr services. The first server (called ld01) has 24 shards and hosts a collection called 'ukdomain'; the second server (ld02) also has 24 shards and hosts a different collection called 'ldwa01'. It's evidently important to note that previously both of these physical servers provided the 'ukdomain' collection, but the 'ldwa01' server has been rebuilt for the new collection. When I start the ldwa01 solr nodes with their zookeeper configuration (defined in /etc/sysconfig/solrnode* and with collection.configName as 'ldwa01cfg') pointing to the development zookeeper ensemble, all nodes initially become shard leaders and then replicas as I'd expect. But if I change the ldwa01 solr nodes to point to the zookeeper ensemble also used for the ukdomain collection, all ldwa01 solr nodes start on the same shard (that is, the first ldwa01 solr node becomes the shard leader, then every other solr node becomes a replica for this shard). The significant point here is no other ldwa01 shards gain leaders (or replicas). The ukdomain collection uses a zookeeper collection.configName of 'ukdomaincfg', and prior to the creation of this ldwa01 service the collection.configName of 'ldwa01cfg' has never previously been used. So I'm
RE: New shard leaders or existing shard replicas depends on zookeeper?
I think my question is easier, because I think the problem below was caused by the very first startup of the 'ldwa01' collection/'ldwa01cfg' zk collection name didn't specify the number of shards (and thus defaulted to 1). So, how can I change the number of shards for an existing collection/zk collection name, especially when the ZK ensemble in question is the production version and supporting other Solr collections that I do not want to interrupt. (Which I think means that I can't just delete the clusterstate.json and restart the ZKs as this will also lose the other Solr collection information.) Thanks in advance, Gil -Original Message- From: Hoggarth, Gil [mailto:gil.hogga...@bl.uk] Sent: 24 October 2013 10:13 To: solr-user@lucene.apache.org Subject: RE: New shard leaders or existing shard replicas depends on zookeeper? Absolutely, the scenario I'm seeing does _sound_ like I've not specified the number of shards, but I think I have - the evidence is: - DnumShards=24 defined within the /etc/sysconfig/solrnode* files - DnumShards=24 seen on each 'ps' line (two nodes listed here): tomcat 26135 1 5 09:51 ?00:00:22 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/log ging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode1 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode1/ldwa01/conf -Dcollection.configName=ldwa01cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode1/ldwa01/data -DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl .uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode1 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode1/tmp org.apache.catalina.startup.Bootstrap start tomcat 26225 1 5 09:51 ?00:00:19 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/log ging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms512m -Xmx5120m -Dsolr.solr.home=/opt/solrnode2 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode2/ldwa01/conf -Dcollection.configName=ldwa01cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode2/ldwa01/data -DzkHost=zk01.solr.wa.bl.uk:9983,zk02.solr.wa.bl.uk:9983,zk03.solr.wa.bl .uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode2 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode2/tmp org.apache.catalina.startup.Bootstrap start - The Solr node dashboard shows -DnumShards=24 in its list of Args for each node And yet, the ldwa01 nodes are leader and replica of shard 17 and there are no other shard leaders created. Plus, if I only change the ZK ensemble declarations in /etc/system/solrnode* to the different dev ZK servers, all 24 leaders are created before any replicas are added. I can also mention, when I browse the Cloud view, I can see both the ldwa01 collection and the ukdomain collection listed, suggesting that this information comes from the ZKs - I assume this is as expected. Plus, the correct node addresses (e.g., 192.168.45.17:8984) are listed for ldwa01 but these addresses are also listed as 'Down' in the ukdomain collection (except for :8983 which only shows in the ldwa01 collection). Any help very gratefully received. Gil -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 23 October 2013 18:50 To: solr-user@lucene.apache.org Subject: Re: New shard leaders or existing shard replicas depends on zookeeper? My first impulse would be to ask how you created the collection. It sure _sounds_ like you didn't specify 24 shards and thus have only a single shard, one leader and 23 replicas bq: ...to point to the zookeeper ensemble also used for the ukdomain collection... so my guess is that this ZK ensemble has the ldwa01 collection defined as having only one shard I admit I pretty much skimmed your post though... Best, Erick On Wed, Oct 23, 2013 at 12:54 PM, Hoggarth, Gil gil.hogga...@bl.uk wrote: Hi solr-users, I'm seeing some confusing behaviour in Solr/zookeeper and hope you can shed some light on what's happening/how I can correct it. We have two physical servers running automated builds of RedHat 6.4 and Solr 4.4.0 that host two separate Solr services. The first server (called ld01) has 24 shards and hosts a collection called 'ukdomain'; the second server (ld02) also has 24 shards and hosts a different collection called 'ldwa01'. It's evidently important to note that previously both of these physical servers provided the 'ukdomain' collection, but the 'ldwa01' server has been rebuilt for the new collection. When I start the ldwa01 solr
New shard leaders or existing shard replicas depends on zookeeper?
Hi solr-users, I'm seeing some confusing behaviour in Solr/zookeeper and hope you can shed some light on what's happening/how I can correct it. We have two physical servers running automated builds of RedHat 6.4 and Solr 4.4.0 that host two separate Solr services. The first server (called ld01) has 24 shards and hosts a collection called 'ukdomain'; the second server (ld02) also has 24 shards and hosts a different collection called 'ldwa'. It's evidently important to note that previously both of these physical servers provided the 'ukdomain' collection, but the 'ldwa' server has been rebuilt for the new collection. When I start the ldwa solr nodes with their zookeeper configuration (defined in /etc/sysconfig/solrnode* and with collection.configName as 'ldwacfg') pointing to the development zookeeper ensemble, all nodes initially become shard leaders and then replicas as I'd expect. But if I change the ldwa solr nodes to point to the zookeeper ensemble also used for the ukdomain collection, all ldwa solr nodes start on the same shard (that is, the first ldwa solr node becomes the shard leader, then every other solr node becomes a replica for this shard). The significant point here is no other ldwa shards gain leaders (or replicas). The ukdomain collection uses a zookeeper collection.configName of 'ukdomaincfg', and prior to the creation of this ldwa service the collection.configName of 'ldwacfg' has never previously been used. So I'm confused why the ldwa service would differ when the only difference is which zookeeper ensemble is used (both zookeeper ensembles are automatedly built using version 3.4.5). If anyone can explain why this is happening and how I can get the ldwa services to start correctly using the non-development zookeeper ensemble, I'd be very grateful! If more information or explanation is needed, just ask. Thanks, Gil Gil Hoggarth Web Archiving Technical Services Engineer The British Library, Boston Spa, West Yorkshire, LS23 7BQ
Solr 4.3.0: Shard instances using incorrect data directory on machine boot
Hi all, I hope you can advise a solution to our incorrect data directory issue. We have 2 physical servers using Solr 4.3.0, each with 24 separate tomcat instances (RedHat 6.4, java 1.7.0_10-b18, tomcat 7.0.34) with a solr shard in each. This configuration means that each shard has its own data directory declared. (Server OS, tomcat and solr, including shards, created via automated builds.) That is, for example, - tomcat instance, /var/local/tomcat/solrshard3/, port 8985 - corresponding solr instance, /usr/local/solrshard3/, with /usr/local/solrshard3/collection1/conf/solrconfig.xml - corresponding solr data directory, /var/local/solrshard3/collection1/data/ We process ~1.5 billion documents, which is why we use so 48 shards (24 leaders, 24 replicas). These physical servers are rebooted regularly to fsck their drives. When rebooted, we always see several (~10-20) shards failing to start (UI cloud view shows them as 'Down' or 'Recovering' though they never recover without intervention), though there is not a pattern to which shards fail to start - we haven't recorded any that always or never fail. On inspection, the UI dashboard for these failed shards displays, for example: - HostServer1 - Instance/usr/local/sholrshard3/collection1 - Data/var/local/solrshard6/collection1/data - Index /var/local/solrshard6/collection1/data/index To fix such failed shards, I manually restart the shard leader and replicas, which fixes the issue. However, of course, I would like to know a permanent cure for this, not a remedy. We use a separate zookeeper service, spread across 3 Virtual Machines within our private network of ~200 servers (physical and virtual). Network traffic is constant but relatively little across 1GB bandwidth. Any advice or suggestions greatly appreciated. Gil Gil Hoggarth Web Archiving Engineer The British Library, Boston Spa, West Yorkshire, LS23 7BQ
RE: Solr 4.3.0: Shard instances using incorrect data directory on machine boot
Thanks for your reply Daniel. The dataDir is set in each solrconfig.xml; each one has been checked to ensure it points to its corresponding location. The error we see is that on machine reboot not all of the shards start successfully, and if the fail was to be a leader the replicas can't take its place (presumably because the leader incorrect data directory is inconsistent with their own). More detail that I can add is that the catalina.out log for failed shards reports: May 15, 2013 5:56:02 PM org.apache.catalina.loader.WebappClassLoader checkThreadLocalMapForLeaks SEVERE: The web application [/solr] created a ThreadLocal with key of type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value [org.apache.solr.schema.DateField$ThreadLocalDateFormat@524e13f6]) and a value of type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat] (value [org.apache.solr.schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak. This doesn't (to me) relate to the problem, but that doesn't necessarily mean it's not. Plus, it's the only SEVERE reported and only reported in the failed shard catalina.out log. Checking the zookeeper logs, we're seeing: 2013-05-16 13:25:46,839 [myid:1] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for id 3, my id = 1, error = java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(Quoru mCnxManager.java:747) 2013-05-16 13:25:46,841 [myid:1] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker 2013-05-16 13:25:46,842 [myid:1] - WARN [SendWorker:3:QuorumCnxManager$SendWorker@679] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.re portInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.aw aitNanos(AbstractQueuedSynchronizer.java:2095) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389 ) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(Quorum CnxManager.java:831) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnx Manager.java:62) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(Quoru mCnxManager.java:667) 2013-05-16 13:25:46,843 [myid:1] - WARN [SendWorker:3:QuorumCnxManager$SendWorker@688] - Send worker leaving thread This is I think as separate issue in that this happens immediately after I restart a zookeeper. (I.e., I see this in a log, restart that zookeeper, and immediately see a similar issue in one of the other two zookeeper logs). -Original Message- From: Daniel Collins [mailto:danwcoll...@gmail.com] Sent: 16 May 2013 13:28 To: solr-user@lucene.apache.org Subject: Re: Solr 4.3.0: Shard instances using incorrect data directory on machine boot What actual error do you see in Solr? Is there an exception and if so, can you post that? As I understand it, datatDir is set from the solrconfig.xml file, so either your instances are picking up the wrong file, or you have some override which is incorrect? Where do you set solr.data.dir, at the environment when you start Solr or in solrconfig? On 16 May 2013 12:23, Hoggarth, Gil gil.hogga...@bl.uk wrote: Hi all, I hope you can advise a solution to our incorrect data directory issue. We have 2 physical servers using Solr 4.3.0, each with 24 separate tomcat instances (RedHat 6.4, java 1.7.0_10-b18, tomcat 7.0.34) with a solr shard in each. This configuration means that each shard has its own data directory declared. (Server OS, tomcat and solr, including shards, created via automated builds.) That is, for example, - tomcat instance, /var/local/tomcat/solrshard3/, port 8985 - corresponding solr instance, /usr/local/solrshard3/, with /usr/local/solrshard3/collection1/conf/solrconfig.xml - corresponding solr data directory, /var/local/solrshard3/collection1/data/ We process ~1.5 billion documents, which is why we use so 48 shards (24 leaders, 24 replicas). These physical servers are rebooted regularly to fsck their drives. When rebooted, we always see several (~10-20) shards failing to start (UI cloud view shows them as 'Down' or 'Recovering' though they never recover without intervention), though there is not a pattern to which shards fail to start - we haven't recorded any that always or never fail. On inspection, the UI dashboard for these failed shards displays, for example: - HostServer1 - Instance/usr/local/sholrshard3/collection1 - Data/var/local/solrshard6/collection1/data - Index
RE: Solr 4.3.0: Shard instances using incorrect data directory on machine boot
Thanks for your response Shawn, very much appreciated. Gil -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: 16 May 2013 15:59 To: solr-user@lucene.apache.org Subject: RE: Solr 4.3.0: Shard instances using incorrect data directory on machine boot The dataDir is set in each solrconfig.xml; each one has been checked to ensure it points to its corresponding location. The error we see is that on machine reboot not all of the shards start successfully, and if the fail was to be a leader the replicas can't take its place (presumably because the leader incorrect data directory is inconsistent with their own). Although you can set the dataDir in solrconfig.xml, I would strongly recommend that you don't. If you are using the old-style solr.xml (which has cores and core tags) then set the dataDir in each core tag in solr.xml. This gets read and set before the core is created, so there's less chance of it getting scrambled. The solrconfig is read as part of core creation. If you are using the new style solr.xml (new with 4.3.0) then you'll need absolute dataDir paths, and they need to go in each core.properties file. Due to a bug, relative paths won't work as expected. I need to see if I can make sure the fix makes it into 4.3.1. If moving dataDir out of solrconfig.xml fixes it, then we probably have a bug. Yout Zookeeper problems might be helped by increasing zkClientTimeout. Thanks, Shawn