RE: Read or Capture Solr Logs
Hello - process the logs means you have to build your own program that reads and processes the logs, and does what ever you need it to. In a custom SearchComponent you can implement e.g. process() [1] and read the query, and do something with it. [1]: http://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/handler/component/SearchComponent.html#process%28org.apache.solr.handler.component.ResponseBuilder%29 -Original message- From:Nitin Solanki nitinml...@gmail.com Sent: Tuesday 24th March 2015 11:55 To: solr-user@lucene.apache.org Subject: Re: Read or Capture Solr Logs Hi Markus, Can you please help me. How to do that? Using both Process the logs or make a simple SearchComponent implementation that reads SolrQueryRequest. On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hello, you can either process the logs, or make a simple SearchComponent implementation that reads SolrQueryRequest. Markus -Original message- From:Nitin Solanki nitinml...@gmail.com Sent: Tuesday 24th March 2015 11:38 To: solr-user@lucene.apache.org Subject: Read or Capture Solr Logs Hello, I want to read or capture all the queries which are searched by users. Any help on this?
Re: TooManyBasicQueries?
Somehow a surround query is being constructed along the way. Search your logs for “surround” and see if someone is maybe sneaking a q={!surround}… in there. If you’re passing input directly through from your application to Solr’s q parameter without any sanitizing or filtering, it’s possible a surround query parser could be asked for. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Mar 24, 2015, at 8:55 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erik - Sorry, I totally missed your reply. To the best of my knowledge, we are not using any surround queries (have to admit I had never heard of them until now). We use solr.SearchHandler for all of our queries. Does that answer the question? Cheers, Ian On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com wrote: It results from a surround query with too many terms. Says the javadoc: * Exception thrown when {@link BasicQueryFactory} would exceed the limit * of query clauses. I’m curious, are you issuing a large {!surround} query or is it expanding to hit that limit? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote: I sometimes see the following in my logs: ERROR org.apache.solr.core.SolrCore – org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded maximum of 1000 basic queries. What does this mean? Does this mean that we have issued a query with too many terms? Or that the number of concurrent queries running on the server is too high? Also, is this a builtin limit or something set in a config file? Thanks! - Ian
Re: TooManyBasicQueries?
Hi Erik - Sorry, I totally missed your reply. To the best of my knowledge, we are not using any surround queries (have to admit I had never heard of them until now). We use solr.SearchHandler for all of our queries. Does that answer the question? Cheers, Ian On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com wrote: It results from a surround query with too many terms. Says the javadoc: * Exception thrown when {@link BasicQueryFactory} would exceed the limit * of query clauses. I’m curious, are you issuing a large {!surround} query or is it expanding to hit that limit? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote: I sometimes see the following in my logs: ERROR org.apache.solr.core.SolrCore – org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded maximum of 1000 basic queries. What does this mean? Does this mean that we have issued a query with too many terms? Or that the number of concurrent queries running on the server is too high? Also, is this a builtin limit or something set in a config file? Thanks! - Ian
rough maximum cores (shards) per machine?
Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Issues to create new core
Dear Solr Community: I just began to work with Solr. I choose Solr 5.0, but when I try to create a new core with GUI, show the following error: Error CREATEing SolrCore 'datos': Unable to create core [datos] Caused by: Can't find resource 'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'. My question is simple, How can fix this problem?. Thanks in advance for your consideration. Alejandro.
Solr replicas going in recovering state during heavy indexing
Hi We have a large solrcloud cluster. We have observed that during heavy indexing, large number of replicas go to recovering or down state. What could be the possible reason and/or fix for the issue. Gopal
Re: maxReplicasPerNode
Thanks Anshum, About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. Just to make sure the answer applies to both these cases: 1. There are two replicas on node1 and node2. Solr won't add a replica to node1 when node2 goes down. 2. The collection was created with rf=2, Solr creates replicas on node1 and node2. If node2 goes down and a node3 comes up instead, will it be assigned a replica, or Solr does not do that also? In short, is there any scenario where Solr would auto-add replicas (aside from running on HDFS) to meet the 'rf' setting, or after the collection has been created, ensuring RF is met is my responsibility? Shai On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Shai, As of now, all replicas for a collections are created to meet the specified replication factor at the time of collection creation. There's no way to defer that until more nodes are up. Your best bet is to have the nodes already up before you CREATE the collection or create the collection with a lower replication factor and then use ADDREPLICA. About auto-addition of replicas, that's kind of supported when using shared file system (HDFS) to host the index. It's doesn't truly work as per your use-case i.e. it doesn't consider the intended replication factor but only brings up a Replica in case all replicas for a node are down, so that SolrCloud continues to be usable. It also doesn't auto-remove replica when the old node comes back up. You can read more about this in the Automatically Add Replicas in SolrCloud section here: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote: Hi I saw that we can define maxShardsPerNode when creating a collection, but I don't see that I can set something similar for replicas. My scenario is the following: - I setup one Solr node - Create collection with numShards=1 and replicationFactor=2 - Hopefully, one replica is created on that node - When I bring up the second Solr node, the second replica will be created What I see is that both replicas are created on the first node, and when I bring up the second Solr node, none of the replicas are moved. I know that I can move one replica by calling ADDREPLICA on node2, then DELETEREPLICA on node1, but I was wondering if there's an automated way to do that. I've also considered creating the collection with replicationFactor=1 and when the second node comes up it will look for shards w/ one replica only, and assign themselves as the replica. But it means I have to own that piece of logic, where if Solr already does that, that's better. Also, from what I understand, if I create a collection w/ rf=2 and there are two nodes, then each node is assigned a replica. If one of the nodes comes down, and a 3rd node comes up, it will be assigned a replica -- is that correct? Another related question, if there are two replicas on node1 and node2, and node2 goes down -- will node1 be assigned the second replica as well? If this is explained somewhere, I'd appreciate if you can give me a pointer. Shai -- Anshum Gupta
Read or Capture Solr Logs
Hello, I want to read or capture all the queries which are searched by users. Any help on this?
How to verify a document is indexed by all replicas
Hi Is there a recommended, preferably fast, way to check that a document is indexed by all replicas? I currently do that by issuing a search request to each replica, but was wondering if there's a faster way. Even better, is there a way to verify all replicas of a shard are up-to-date, e.g. by comparing their version or something? By up-to-date I mean that they've all processed the same update requests that came through. If there's a replica lagging behind, I'd like to wait for it to catch up, something like a checkpoint(), before I continue sending more updates. Shai
Set search query logs into Solr
Hello, I want to insert searched queries into solr log to track the input of users. I googled too much but didn't find anything. Please help. Your help will be appreciated...
Re: _text
Hi Philippe, Are you using the default schemaFactory, in which your setting in solrconfig.xml is schemaFactory class=ManagedIndexSchemaFactory, or you have used your own defined schema.xml, in which your setting in solrconfig.xml should be schemaFactory class=ClassicIndexSchemaFactory/? Regards, Edwin On 24 March 2015 at 17:40, phi...@free.fr wrote: Hello, my SOLR 5 Admin Panel displays the following error: 23/03/2015 15:05:05 ERROR SolrCore org.apache.solr.common.SolrException: undefined field: _text How should _text be defined in schema.xml? Many thanks. Philippe
Re: _text
Hi Zheng, I copied the SOLR 5 schema.xml file on Github (?), which contains the following line: schemaFactory class=ClassicIndexSchemaFactory/ - Mail original - De: Zheng Lin Edwin Yeo edwinye...@gmail.com À: solr-user@lucene.apache.org Envoyé: Mardi 24 Mars 2015 10:59:49 Objet: Re: _text Hi Philippe, Are you using the default schemaFactory, in which your setting in solrconfig.xml is schemaFactory class=ManagedIndexSchemaFactory, or you have used your own defined schema.xml, in which your setting in solrconfig.xml should be schemaFactory class=ClassicIndexSchemaFactory/? Regards, Edwin On 24 March 2015 at 17:40, phi...@free.fr wrote: Hello, my SOLR 5 Admin Panel displays the following error: 23/03/2015 15:05:05 ERROR SolrCore org.apache.solr.common.SolrException: undefined field: _text How should _text be defined in schema.xml? Many thanks. Philippe
RE: Read or Capture Solr Logs
Hello, you can either process the logs, or make a simple SearchComponent implementation that reads SolrQueryRequest. Markus -Original message- From:Nitin Solanki nitinml...@gmail.com Sent: Tuesday 24th March 2015 11:38 To: solr-user@lucene.apache.org Subject: Read or Capture Solr Logs Hello, I want to read or capture all the queries which are searched by users. Any help on this?
Re: _text
Hi Philippe, That means you're using the physical schema.xml. You can check the file in your collection, under conf folder. For mine I don't have the _text field in my schema.xml. If you don't require it in your setup, you can try removing it and see if it's ok? Else you can use the schema.xml or the entire conf folder from the techproducts example located at {SOLR_HOME}\server\solr\configsets\sample_techproducts_configs\CONF, which comes together with the Solr 5.0 package. On 24 March 2015 at 18:12, phi...@free.fr wrote: Hi Zheng, I copied the SOLR 5 schema.xml file on Github (?), which contains the following line: schemaFactory class=ClassicIndexSchemaFactory/ - Mail original - De: Zheng Lin Edwin Yeo edwinye...@gmail.com À: solr-user@lucene.apache.org Envoyé: Mardi 24 Mars 2015 10:59:49 Objet: Re: _text Hi Philippe, Are you using the default schemaFactory, in which your setting in solrconfig.xml is schemaFactory class=ManagedIndexSchemaFactory, or you have used your own defined schema.xml, in which your setting in solrconfig.xml should be schemaFactory class=ClassicIndexSchemaFactory/? Regards, Edwin On 24 March 2015 at 17:40, phi...@free.fr wrote: Hello, my SOLR 5 Admin Panel displays the following error: 23/03/2015 15:05:05 ERROR SolrCore org.apache.solr.common.SolrException: undefined field: _text How should _text be defined in schema.xml? Many thanks. Philippe
Re: Read or Capture Solr Logs
Hi Markus, Can you please help me. How to do that? Using both Process the logs or make a simple SearchComponent implementation that reads SolrQueryRequest On Tue, Mar 24, 2015 at 4:25 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Markus, Can you please help me. How to do that? Using both Process the logs or make a simple SearchComponent implementation that reads SolrQueryRequest. On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hello, you can either process the logs, or make a simple SearchComponent implementation that reads SolrQueryRequest. Markus -Original message- From:Nitin Solanki nitinml...@gmail.com Sent: Tuesday 24th March 2015 11:38 To: solr-user@lucene.apache.org Subject: Read or Capture Solr Logs Hello, I want to read or capture all the queries which are searched by users. Any help on this?
_text
Hello, my SOLR 5 Admin Panel displays the following error: 23/03/2015 15:05:05 ERROR SolrCore org.apache.solr.common.SolrException: undefined field: _text How should _text be defined in schema.xml? Many thanks. Philippe
Re: Custom TokenFilter
bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat java.lang.Class.asSubclass(Class.java:3208)at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards.
Custom TokenFilter
Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat java.lang.Class.asSubclass(Class.java:3208)at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards.
Re: rough maximum cores (shards) per machine?
Test Test: From Hossman's apache page: When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. Also, please format your stack trace for readability. On a quick glance, you probably have mis-matched jars in your classpath. On Tue, Mar 24, 2015 at 1:35 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xml at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) ... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory at java.lang.Class.asSubclass(Class.java:3208) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593) at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342) at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards. Le Mardi 24 mars 2015 21h24, Jack Krupansky jack.krupan...@gmail.com a écrit : I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Ever hear the old saying Too many cooks spoil the stew? -- Jack Krupansky On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: Don't confuse customers and tenants. Perhaps you could explain what you mean by multi-tenant in the context of Ian's setup? It is not clear to me what the distinction is in this case. - Toke Eskildsen
Re: maxReplicasPerNode
Yes, it applies to both. Solr wouldn't auto-add replicas in either of those cases (or any other case) to meet the rf specified at create time. On Tue, Mar 24, 2015 at 2:22 AM, Shai Erera ser...@gmail.com wrote: Thanks Anshum, About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. Just to make sure the answer applies to both these cases: 1. There are two replicas on node1 and node2. Solr won't add a replica to node1 when node2 goes down. 2. The collection was created with rf=2, Solr creates replicas on node1 and node2. If node2 goes down and a node3 comes up instead, will it be assigned a replica, or Solr does not do that also? In short, is there any scenario where Solr would auto-add replicas (aside from running on HDFS) to meet the 'rf' setting, or after the collection has been created, ensuring RF is met is my responsibility? Shai On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Shai, As of now, all replicas for a collections are created to meet the specified replication factor at the time of collection creation. There's no way to defer that until more nodes are up. Your best bet is to have the nodes already up before you CREATE the collection or create the collection with a lower replication factor and then use ADDREPLICA. About auto-addition of replicas, that's kind of supported when using shared file system (HDFS) to host the index. It's doesn't truly work as per your use-case i.e. it doesn't consider the intended replication factor but only brings up a Replica in case all replicas for a node are down, so that SolrCloud continues to be usable. It also doesn't auto-remove replica when the old node comes back up. You can read more about this in the Automatically Add Replicas in SolrCloud section here: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote: Hi I saw that we can define maxShardsPerNode when creating a collection, but I don't see that I can set something similar for replicas. My scenario is the following: - I setup one Solr node - Create collection with numShards=1 and replicationFactor=2 - Hopefully, one replica is created on that node - When I bring up the second Solr node, the second replica will be created What I see is that both replicas are created on the first node, and when I bring up the second Solr node, none of the replicas are moved. I know that I can move one replica by calling ADDREPLICA on node2, then DELETEREPLICA on node1, but I was wondering if there's an automated way to do that. I've also considered creating the collection with replicationFactor=1 and when the second node comes up it will look for shards w/ one replica only, and assign themselves as the replica. But it means I have to own that piece of logic, where if Solr already does that, that's better. Also, from what I understand, if I create a collection w/ rf=2 and there are two nodes, then each node is assigned a replica. If one of the nodes comes down, and a 3rd node comes up, it will be assigned a replica -- is that correct? Another related question, if there are two replicas on node1 and node2, and node2 goes down -- will node1 be assigned the second replica as well? If this is explained somewhere, I'd appreciate if you can give me a pointer. Shai -- Anshum Gupta -- Anshum Gupta
Re: document contained more than 100000 characters
On 3/23/2015 3:08 AM, Srinivas wrote: Present in my project we are using apache tika for reading metadata of the file,So whenever we handled large files(contained more than 10 characters file) tika generating the error is file contained more than 10 characters, So is it possible or not handling large files by using tika,Please let me know. This sounds like a Tika problem. This is a solr mailing list. You may find some Tika expertise here, but this is the incorrect place for a question about Tika. Solr does use the Tika parser, in the contrib module for the ExtractingRequestHandler. I have never heard of such a limitation in the context of the ExtractingRequestHandler, and I've heard some people complain about OutOfMemory exceptions when they index 4 gigabyte PDF files with our extracting handler ... so I am guessing that you are using Tika in your own software. If that is correct, you'll need to ask your question on a Tika mailing list. Thanks, Shawn
Re: How to verify a document is indexed by all replicas
You can always issue a *:* query, but it'd have to be at least your autoSoftCommit interval ago since the soft commit trigger will have slightly different wall clock times. But it shouldn't be necessary to wait I don't think. Since the indexing request doesn't succeed until the docs have been written to the tlogs, and since the tlogs will be replayed in the event of a problem your data should be fine. Of course if you're indexing at a very fast rate and your tlog is huge, it'll take a while FWIW, Erick On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote: Hi Is there a recommended, preferably fast, way to check that a document is indexed by all replicas? I currently do that by issuing a search request to each replica, but was wondering if there's a faster way. Even better, is there a way to verify all replicas of a shard are up-to-date, e.g. by comparing their version or something? By up-to-date I mean that they've all processed the same update requests that came through. If there's a replica lagging behind, I'd like to wait for it to catch up, something like a checkpoint(), before I continue sending more updates. Shai
Setting up SOLR 5 from an RPM
Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and it installs an init.d script and the equivalent of the tarball to /opt/solr. We're having problems running SOLR from the installed files, as SOLR wants to (I think) extract the WAR file and create various temporary files below /opt/solr/server. We currently have this structure: /data/solr - root directory of our solr instance /data/solr/{logs,run} - log/run directories /data/solr/cores - configuration for our cores and solr.in.sh /opt/solr - the RPM installed solr 5 The user running solr can modify anything under /data/solr, but nothing under /opt/solr. Is this sort of configuration supported? Am I missing some variable in our solr.in.sh that sets where temporary files can be extracted? We currently set: SOLR_PID_DIR=/data/solr/run SOLR_HOME=/data/solr/cores SOLR_LOGS_DIR=/data/solr/logs Cheers Tom
Re: How to remove an Alert
On 3/23/2015 2:35 PM, jack.met...@hp.com wrote: I have a problem with [ ... briefly describe your problem here ... ] [ ... insert additional info here - keep it short and to the point ... ] Below are some SPM graphs showing the state of my system. Here's the 'Threads' graph: https://apps.sematext.com/spm-reports/s/aFUIR1fecb You've used some kind of boilerplate help request, but forgot to edit it for your specific problem. Solr doesn't send alerts, so the subject of your message makes no sense in a Solr context, and you haven't indicated how it connects with the SPM graph you linked. You'll need to ask an actual question and provide relevant details from your system to support your question. Thanks, Shawn
Re: TooManyBasicQueries?
Ah yes, right you are. I had thought that `surround` required a different endpoint, but I see now that someone is using a surround query. Many thanks! On Tue, Mar 24, 2015 at 10:02 AM, Erik Hatcher erik.hatc...@gmail.com wrote: Somehow a surround query is being constructed along the way. Search your logs for “surround” and see if someone is maybe sneaking a q={!surround}… in there. If you’re passing input directly through from your application to Solr’s q parameter without any sanitizing or filtering, it’s possible a surround query parser could be asked for. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Mar 24, 2015, at 8:55 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erik - Sorry, I totally missed your reply. To the best of my knowledge, we are not using any surround queries (have to admit I had never heard of them until now). We use solr.SearchHandler for all of our queries. Does that answer the question? Cheers, Ian On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher erik.hatc...@gmail.com wrote: It results from a surround query with too many terms. Says the javadoc: * Exception thrown when {@link BasicQueryFactory} would exceed the limit * of query clauses. I’m curious, are you issuing a large {!surround} query or is it expanding to hit that limit? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com On Mar 13, 2015, at 9:44 AM, Ian Rose ianr...@fullstory.com wrote: I sometimes see the following in my logs: ERROR org.apache.solr.core.SolrCore – org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded maximum of 1000 basic queries. What does this mean? Does this mean that we have issued a query with too many terms? Or that the number of concurrent queries running on the server is too high? Also, is this a builtin limit or something set in a config file? Thanks! - Ian
Re: Solr replicas going in recovering state during heavy indexing
What do the Solr logs show happens on those servers when they go into recovery? What have you tried to do to diagnose the problem? You might review: http://wiki.apache.org/solr/UsingMailingLists The first thing I'd check, though, is whether you're seeing large GC pauses that exceed the Zookeeper timeout, thus ZK thinks the replica is down and puts it into recovery. YOu can get this info by tracking the GC cycles as here: https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/, the section getting a view into garbage collection Best, Erick On Tue, Mar 24, 2015 at 5:57 AM, Gopal Jee gopal@myntra.com wrote: Hi We have a large solrcloud cluster. We have observed that during heavy indexing, large number of replicas go to recovering or down state. What could be the possible reason and/or fix for the issue. Gopal
Re: rough maximum cores (shards) per machine?
Well, there's a ticket out there for thousands of collections on a single machine, although this is wy out there. I often see 10-20 small cores on a 4-8 core machine if they're reasonably small (a few million docs). I see a single replica strain a 128G 16 core machine if it has 300M docs Which is a way of saying ya gotta test with your data/query mix. Wish there was a better answer. Erick On Tue, Mar 24, 2015 at 6:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: rough maximum cores (shards) per machine?
Shards per collection, or across all collections on the node? It will all depend on: 1. Your ingestion/indexing rate. High, medium or low? 2. Your query access pattern. Note that a typical query fans out to all shards, so having more shards than CPU cores means less parallelism. 3. How many collections you will have per node. In short, it depends on what you want to achieve, not some limit of Solr per se. Why are you even sharding the node anyway? Why not just run with a single shard per node, and do sharding by having separate nodes, to maximize parallel processing and availability? Also be careful to be clear about using the Solr term shard (a slice, across all replica nodes) as distinct from the Elasticsearch term shard (a single slice of an index for a single replica, analogous to a Solr core.) -- Jack Krupansky On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: maxReplicasPerNode
On 3/24/2015 3:22 AM, Shai Erera wrote: If this is explained somewhere, I'd appreciate if you can give me a pointer. I don't think it's explained anywhere, so that's a lack in the documentation. One problem with automatic replica addition in response to cluster problems is that there is no mechanism (currently, at least) to indicate that a node disappearance is intentional and temporary, and no way to configure a minimum time interval before taking automatic action. It would be necessary to have these mechanisms before any kind of automatic repair ability could be implemented. Thanks, Shawn
Re: Unable to setup solr cloud with multiple collections.
Why are you doing this in the first place? SolrCloud and master/slave are fundamentally different. When running in SolrCloud mode, there is no need whatsoever to configure replication as per the Wiki link you've outlined above, that's for the older style master/slave setups. Just change it back and watch the magic would be my advice. So if you'd tell us why you thought this was necessary, perhaps we can suggest alternatives because from a quick glance it looks unnecessary, and in fact harmful. Best, Erick On Mon, Mar 23, 2015 at 10:08 PM, sthita sthit...@gmail.com wrote: I have newly created a new collection and activated the replication for 4 nodes(Including masters). After doing the config changes as suggested on http://wiki.apache.org/solr/SolrReplication http://wiki.apache.org/solr/SolrReplication The nodes of the newly created collections are down on solr cloud. We are not able to add or remove any document on newly created core i.e dict_cn in our case. All the configuration files look ok on solr cloud http://lucene.472066.n3.nabble.com/file/n4194833/solr_issue.png This is my replication changes on solrconfig.xml requestHandler name=/replication class=solr.ReplicationHandler startup=lazy lst name=master str name=replicateAftercommit/str str name=replicateAfterstartup/str str name=confFilessolrconfig_cn.xml,schema_cn.xml/str /lst lst name=slave str name=masterUrlhttp://mail:8983/solr/dict_cn/str /lst Note: I am using solr 4.4.0, zookeeper-3.4.5 Can anyone help me on this ? -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issues to create new core
Tell us all the steps you went through to do this. Note that you should _not_ be using the core admin in the admin UI if you're working with SolrCloud. For stand-alone Solr, the message above is probably caused by your not having a conf directory set up already. The core admin UI expects that you have a pre-existing directory with a conf directory that contains solrconfig.xml, schema.xml, and all the rest of the configuration files. You can specify this via some of the parameters on the admin UI screen (see instanceDir and dataDir). Each core must be in a separate directory or Bad Things Happen. HTH, Erick On Tue, Mar 24, 2015 at 7:01 AM, Alejandro Jesus Mariño Molerio ajmar...@estudiantes.uci.cu wrote: Dear Solr Community: I just began to work with Solr. I choose Solr 5.0, but when I try to create a new core with GUI, show the following error: Error CREATEing SolrCore 'datos': Unable to create core [datos] Caused by: Can't find resource 'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'. My question is simple, How can fix this problem?. Thanks in advance for your consideration. Alejandro.
Re: Auto naming replicas via ADDREPLICA
On 3/23/2015 10:48 AM, Shai Erera wrote: The 'name' param isn't set when I send the URL request (and it's also not specified in the reference guide), but only when I add the replica using SolrJ. I then tweaked my code to do the following: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica() { @Override public SolrParams getParams() { final ModifiableSolrParams params = (ModifiableSolrParams) super.getParams(); params.remove(CoreAdminParams.NAME); return params; } }; And voila, the core is now also named mycollection_shard1_replica2, and I'm even able to add as many replicas as I want on this node (where before it failed since 'mycollection' already existed). The 'name' parameter is added by CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix it: 1. Remove it in AddReplica.getParams() -- replicas will always be auto-named. It makes sense as users didn't have control over it before, and maybe they shouldn't. 2. Add a setCoreName to AddReplica request -- this would be nice if someone wanted to control the name of the added replica, but otherwise should not be included in the request Or maybe we fix the bug by doing #1 and consider #2 as a new feature allow naming replicas? Doing both sounds like a good solution to me. I'm trying to think of some cautionary text for the javadoc on the new method, but I'm not really sure what it should say. Perhaps something like when this method is not used, the new core will receive a name like collection_shardN_replicaN, be aware that if you override it, understanding the collection layout may be more difficult. I'm hoping Mark and/or Yonik (or someone else, if they know) can comment about why the AddReplica code had that behavior and whether this is a good idea in the larger SolrCloud environment. Thanks, Shawn
Re: How to verify a document is indexed by all replicas
Thanks Erick, When a replica is down, no updates are sent to it. When it comes back up, it discovers that it needs to catch-up with the leader. If there are many events it falls back to index replication (slower). During this period of time, is the replica considered ACTIVE or RECOVERING? And, can I assume that at any given moment (aside from ZK connection timeouts etc.) when I check the replicas' state, all the ones that report ACTIVE are in sync with each other? Shai On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson erickerick...@gmail.com wrote: You can always issue a *:* query, but it'd have to be at least your autoSoftCommit interval ago since the soft commit trigger will have slightly different wall clock times. But it shouldn't be necessary to wait I don't think. Since the indexing request doesn't succeed until the docs have been written to the tlogs, and since the tlogs will be replayed in the event of a problem your data should be fine. Of course if you're indexing at a very fast rate and your tlog is huge, it'll take a while FWIW, Erick On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote: Hi Is there a recommended, preferably fast, way to check that a document is indexed by all replicas? I currently do that by issuing a search request to each replica, but was wondering if there's a faster way. Even better, is there a way to verify all replicas of a shard are up-to-date, e.g. by comparing their version or something? By up-to-date I mean that they've all processed the same update requests that came through. If there's a replica lagging behind, I'd like to wait for it to catch up, something like a checkpoint(), before I continue sending more updates. Shai
Re: Unable to setup solr cloud with multiple collections.
Thanks Erick for your reply. I am trying to create a new core i.e dict_cn , which is totally different in terms of index data, configs etc from the existing core abc. The core is created successfully in my master (i.e mail) and i can do solr query on this newly created core . All the config files(Schema.xml and solrconfig.xml) are in mail server and zookeper helps it for me to share all config files to other collections. I did the similar setup to other collection , so that newly created core should be available to all the collections, but it is still showing down. -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: rough maximum cores (shards) per machine?
First off thanks everyone for the very useful replies thus far. Shawn - thanks for the list of items to check. #1 and #2 should be fine for us and I'll check our ulimit for #3. To add a bit of clarification, we are indeed using SolrCloud. Our current setup is to create a new collection for each customer. For now we allow SolrCloud to decide for itself where to locate the initial shard(s) but in time we expect to refine this such that our system will automatically choose the least loaded nodes according to some metric(s). Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Jack, can you explain a bit what you mean here? It looks like Toke caught your meaning but I'm afraid it missed me. What do you mean by business entity? Is your concern that with automatic creation of collections they will be distributed willy-nilly across the cluster, leading to uneven load across nodes? If it is relevant, the schema and solrconfig are controlled entirely by me and is the same for all collections. Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer - less chance of accidentally leaking data across customers) to use separate collections. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Regarding this, if by tenant you mean customer, this is not viable for us from a cost perspective. As I mentioned initially, many of our customers are very small so dedicating an entire machine to each of them would not be economical (or efficient). Or perhaps I am not understanding what your definition of tenant is? Cheers, Ian On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. It was my understanding that Ian used them interchangeably, but of course Ian it the only one that knows. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Thank you. Now your post makes a lot more sense. I will not argue against that. - Toke Eskildsen
Re: rough maximum cores (shards) per machine?
From my experience on a high-end sever (256GB memory, 40 core CPU) testing collection numbers with one shard and two replicas, the maximum that would work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps half of that), depending on your startup-time requirements. (Though I have settled on 6,000 collection maximum with some patching. See SOLR-7191). You could create multiple clouds after that, and choose the cloud least used to create your collection. Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per collection. On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote: First off thanks everyone for the very useful replies thus far. Shawn - thanks for the list of items to check. #1 and #2 should be fine for us and I'll check our ulimit for #3. To add a bit of clarification, we are indeed using SolrCloud. Our current setup is to create a new collection for each customer. For now we allow SolrCloud to decide for itself where to locate the initial shard(s) but in time we expect to refine this such that our system will automatically choose the least loaded nodes according to some metric(s). Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Jack, can you explain a bit what you mean here? It looks like Toke caught your meaning but I'm afraid it missed me. What do you mean by business entity? Is your concern that with automatic creation of collections they will be distributed willy-nilly across the cluster, leading to uneven load across nodes? If it is relevant, the schema and solrconfig are controlled entirely by me and is the same for all collections. Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer - less chance of accidentally leaking data across customers) to use separate collections. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Regarding this, if by tenant you mean customer, this is not viable for us from a cost perspective. As I mentioned initially, many of our customers are very small so dedicating an entire machine to each of them would not be economical (or efficient). Or perhaps I am not understanding what your definition of tenant is? Cheers, Ian On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. It was my understanding that Ian used them interchangeably, but of course Ian it the only one that knows. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Thank you. Now your post makes a lot more sense. I will not argue against that. - Toke Eskildsen -- Damien Kamerman
Re: Using G1 with Apache Solr
On 3/24/2015 3:48 PM, Kamran Khawaja wrote: I'm running Solr 4.7.2 with Java 7u75 with the following JVM params: -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -Xmx3072m -Xms3072m -XX:+UseG1GC -XX:+UseLargePages -XX:+AggressiveOpts -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=35 What I'm currently seeing is that many of the gc pauses are under an acceptable 0.25 seconds but seeing way too many full GCs with an average stop time of 3.2 seconds. You can find the gc logs here: https://www.dropbox.com/s/v04b336v2k5l05e/g1_gc_7u75.log.gz?dl=0 I initially tested without specifying the HeapRegionSize but that resulted in the humongous message in the gc logs and a ton of full gc pauses. This is similar to the settings I've been working on that I've documented on my wiki page, with better results than you are seeing, and a larger heap than you have configured: https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector You have one additional option that I don't -- InitiatingHeapOccupancyPercent. I would suggest running without that option to see how it affects your GC times. I'm curious what OS you're running under, whether the OS and Java are 64-bit, and whether you have actually enabled huge pages in your operating system. If it's Linux and you have enabled huge pages, have you turned off transparent huge pages as documented by Oracle: https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge On my servers, I do *not* have huge pages configured in the operating system, so the UseLargePages java option isn't doing anything. One final thing ... Oracle developers have claimed that Java 8u40 has some major improvements to the G1 collector, particularly for programs that allocate very large objects. Can you try 8u40? Thanks, Shawn
Re: Using G1 with Apache Solr
On 3/24/2015 9:52 PM, Shawn Heisey wrote: On 3/24/2015 3:48 PM, Kamran Khawaja wrote: I'm running Solr 4.7.2 with Java 7u75 with the following JVM params: I really got my wires crossed. Kamran sent his message to the hostpot-gc-use mailing list, not the solr-user list! Thanks, Shawn
Re: Solr 5.0 -- IllegalStateException: unexpected docvalues type NONE on result grouping
On 3/12/2015 3:36 PM, Alexandre Rafalovitch wrote: Manual optimize is no longer needed for modern Solr. It does great optimization automatically. The only reason I recommended it here is to make sure that all segments are brought up to the latest version and the deleted documents are purged. That's something that also would happen automatically eventually, but eventually was not an option for you. I am glad this helped. I am not 100% sure if you have to do it on each shard in SolrCloud mode, but I suspect so. In SolrCloud, whenever you send an optimize command to any shard replica in a collection, the entire collection will be optimized. SolrCloud will do the optimization sequentially, not in parallel. There is currently no way to optimize only one shard replica, and as far as I know, there is no way to ask for a parallel optimization. Alexandre's comments about the necessity of optimization (whether it's SolrCloud or not) is spot on. The only time that optimization should be done on a modern Solr index is when you have a lot of deleted documents and want to clean those up, either to reclaim disk space or remove them from the relevancy calculation. Most people do see a performance boost on an optimized index compared to a non-optimized index, but with a modern Solr install, you might actually see better performance on a multi-segment index when the indexing rate is high, because Lucene is moving to a model where there are per-segment caches that are not invalidated at commit time, only at merge time. Thanks, Shawn
Re: rough maximum cores (shards) per machine?
Don't confuse customers and tenants. -- Jack Krupansky On Tue, Mar 24, 2015 at 2:24 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Sorry Jack. That doesn't scale when you have millions of customers. And these are good problems to have! On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Multi-tenancy is a bad idea for a single solr Cluster. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Think about it: If there are a small number of tenants, just giving each their own machine will be cheaper than the effort spent managing a multi-tenant cluster, and if there are a large number of tenants of even a moderate number of large tenants, you can't expect them to all run reasonably on a relatively small cluster. Think about scalability. -- Jack Krupansky On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose ianr...@fullstory.com wrote: Let me give a bit of background. Our Solr cluster is multi-tenant, where we use one collection for each of our customers. In many cases, these customers are very tiny, so their collection consists of just a single shard on a single Solr node. In fact, a non-trivial number of them are totally empty (e.g. trial customers that never did anything with their trial account). However there are also some customers that are larger, requiring their collection to be sharded. Our strategy is to try to keep the total documents in any one shard under 20 million (honestly not sure where my coworker got that number from - I am open to alternatives but I realize this is heavily app-specific). So my original question is not related to indexing or query traffic, but just the sheer number of cores. For example, if I have 10 active cores on a machine and everything is working fine, should I expect that everything will still work fine if I add 10 nearly-idle cores to that machine? What about 100? 1000? I figure the overhead of each core is probably fairly low but at some point starts to matter. Does that make sense? - Ian On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Shards per collection, or across all collections on the node? It will all depend on: 1. Your ingestion/indexing rate. High, medium or low? 2. Your query access pattern. Note that a typical query fans out to all shards, so having more shards than CPU cores means less parallelism. 3. How many collections you will have per node. In short, it depends on what you want to achieve, not some limit of Solr per se. Why are you even sharding the node anyway? Why not just run with a single shard per node, and do sharding by having separate nodes, to maximize parallel processing and availability? Also be careful to be clear about using the Solr term shard (a slice, across all replica nodes) as distinct from the Elasticsearch term shard (a single slice of an index for a single replica, analogous to a Solr core.) -- Jack Krupansky On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian -- Regards, Shalin Shekhar Mangar.
Re: How to verify a document is indexed by all replicas
Hi Shai, To your original question on how to know if a document has been indexed at all replicas -- You can add a min_rf=true parameter to your indexing request and then Solr will add information to the response about how many replicas gave an ack' to the leader. So if the returned number is equal to the number of replicas, you can be sure that the doc has been indexed everywhere. More comments inline: On Tue, Mar 24, 2015 at 8:18 AM, Shai Erera ser...@gmail.com wrote: Thanks Erick, When a replica is down, no updates are sent to it. When it comes back up, it discovers that it needs to catch-up with the leader. If there are many events it falls back to index replication (slower). During this period of time, is the replica considered ACTIVE or RECOVERING? It is marked as recovering. And, can I assume that at any given moment (aside from ZK connection timeouts etc.) when I check the replicas' state, all the ones that report ACTIVE are in sync with each other? Yes, 'active' replicas should be in sync but autoCommits can cause inconsistency between replicas as to what is visible to searchers (even if all replicas have indexed the same data). Also, checking the state of the replica is not enough, one should always check for the state=active and live-ness of the replica i.e. the node is marked live under /live_nodes in ZK. Shai On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson erickerick...@gmail.com wrote: You can always issue a *:* query, but it'd have to be at least your autoSoftCommit interval ago since the soft commit trigger will have slightly different wall clock times. But it shouldn't be necessary to wait I don't think. Since the indexing request doesn't succeed until the docs have been written to the tlogs, and since the tlogs will be replayed in the event of a problem your data should be fine. Of course if you're indexing at a very fast rate and your tlog is huge, it'll take a while FWIW, Erick On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote: Hi Is there a recommended, preferably fast, way to check that a document is indexed by all replicas? I currently do that by issuing a search request to each replica, but was wondering if there's a faster way. Even better, is there a way to verify all replicas of a shard are up-to-date, e.g. by comparing their version or something? By up-to-date I mean that they've all processed the same update requests that came through. If there's a replica lagging behind, I'd like to wait for it to catch up, something like a checkpoint(), before I continue sending more updates. Shai -- Regards, Shalin Shekhar Mangar.
Re: rough maximum cores (shards) per machine?
Let me give a bit of background. Our Solr cluster is multi-tenant, where we use one collection for each of our customers. In many cases, these customers are very tiny, so their collection consists of just a single shard on a single Solr node. In fact, a non-trivial number of them are totally empty (e.g. trial customers that never did anything with their trial account). However there are also some customers that are larger, requiring their collection to be sharded. Our strategy is to try to keep the total documents in any one shard under 20 million (honestly not sure where my coworker got that number from - I am open to alternatives but I realize this is heavily app-specific). So my original question is not related to indexing or query traffic, but just the sheer number of cores. For example, if I have 10 active cores on a machine and everything is working fine, should I expect that everything will still work fine if I add 10 nearly-idle cores to that machine? What about 100? 1000? I figure the overhead of each core is probably fairly low but at some point starts to matter. Does that make sense? - Ian On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Shards per collection, or across all collections on the node? It will all depend on: 1. Your ingestion/indexing rate. High, medium or low? 2. Your query access pattern. Note that a typical query fans out to all shards, so having more shards than CPU cores means less parallelism. 3. How many collections you will have per node. In short, it depends on what you want to achieve, not some limit of Solr per se. Why are you even sharding the node anyway? Why not just run with a single shard per node, and do sharding by having separate nodes, to maximize parallel processing and availability? Also be careful to be clear about using the Solr term shard (a slice, across all replica nodes) as distinct from the Elasticsearch term shard (a single slice of an index for a single replica, analogous to a Solr core.) -- Jack Krupansky On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: Auto naming replicas via ADDREPLICA
I use vanilla 5.0. I intended to fix it myself, but if you want to go ahead, I'd be happy to review the patch. Shai On Tue, Mar 24, 2015 at 6:11 PM, Anshum Gupta ans...@anshumgupta.net wrote: It's certainly looks like a bug and the name shouldn't be added to the request automatically. Can you confirm what version of Solr are you using? If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to both #1 and #2. On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera ser...@gmail.com wrote: Shawn, that was a great tip! When I tried the URL, the core was named as expected (mycollection_shard1_replica2). I then compared the URLs as reported in the logs, and I believe I found the bug: SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1 *name=mycollection*action=ADDREPLICA*collection=mycollection* wt=javabinversion=2} The 'name' param isn't set when I send the URL request (and it's also not specified in the reference guide), but only when I add the replica using SolrJ. I then tweaked my code to do the following: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica() { @Override public SolrParams getParams() { final ModifiableSolrParams params = (ModifiableSolrParams) super.getParams(); params.remove(CoreAdminParams.NAME); return params; } }; And voila, the core is now also named mycollection_shard1_replica2, and I'm even able to add as many replicas as I want on this node (where before it failed since 'mycollection' already existed). The 'name' parameter is added by CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix it: 1. Remove it in AddReplica.getParams() -- replicas will always be auto-named. It makes sense as users didn't have control over it before, and maybe they shouldn't. 2. Add a setCoreName to AddReplica request -- this would be nice if someone wanted to control the name of the added replica, but otherwise should not be included in the request Or maybe we fix the bug by doing #1 and consider #2 as a new feature allow naming replicas? Shai On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey apa...@elyograg.org wrote: On 3/23/2015 9:27 AM, Shai Erera wrote: I have a Solr cluster started (all programmatically) with one Solr node, one collection and one shard. I set the replicationFactor to 1. The name of the result core was set to mycollection_shard1_replica1. I then start a second Solr node and issue an ADDREPLICA command as described in the reference guide, using following code: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica(); addReplicaRequest.setCollectionName(mycollection); addReplicaRequest.setShardName(shard1); final CollectionAdminResponse response = addReplicaRequest.process(solrClient); The replica is added under a core named mycollection and not e.g. mycollection_shard1_replica2. I'd call that a bug. BTW, the example in the reference guide shows that issuing the request: http://localhost:8983/solr/admin/collections?action=ADDREPLICAcollection=test2shard=shard2node=192.167.1.2:8983_solr Results in response lst name=responseHeader int name=status0/int int name=QTime3764/int /lst lst name=success lst lst name=responseHeader int name=status0/int int name=QTime3450/int /lst * str name=coretest2_shard2_replica4/str* Did you try out a URL like that to see whether it also results in the misnamed core, or if it behaves correctly as the reference guide indicates? If the URL behaves correctly, I'd be curious what Solr logs for the URL request and the SolrJ request. Thanks, Shawn -- Anshum Gupta
Re: Auto naming replicas via ADDREPLICA
It's certainly looks like a bug and the name shouldn't be added to the request automatically. Can you confirm what version of Solr are you using? If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to both #1 and #2. On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera ser...@gmail.com wrote: Shawn, that was a great tip! When I tried the URL, the core was named as expected (mycollection_shard1_replica2). I then compared the URLs as reported in the logs, and I believe I found the bug: SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1 *name=mycollection*action=ADDREPLICA*collection=mycollection* wt=javabinversion=2} The 'name' param isn't set when I send the URL request (and it's also not specified in the reference guide), but only when I add the replica using SolrJ. I then tweaked my code to do the following: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica() { @Override public SolrParams getParams() { final ModifiableSolrParams params = (ModifiableSolrParams) super.getParams(); params.remove(CoreAdminParams.NAME); return params; } }; And voila, the core is now also named mycollection_shard1_replica2, and I'm even able to add as many replicas as I want on this node (where before it failed since 'mycollection' already existed). The 'name' parameter is added by CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix it: 1. Remove it in AddReplica.getParams() -- replicas will always be auto-named. It makes sense as users didn't have control over it before, and maybe they shouldn't. 2. Add a setCoreName to AddReplica request -- this would be nice if someone wanted to control the name of the added replica, but otherwise should not be included in the request Or maybe we fix the bug by doing #1 and consider #2 as a new feature allow naming replicas? Shai On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey apa...@elyograg.org wrote: On 3/23/2015 9:27 AM, Shai Erera wrote: I have a Solr cluster started (all programmatically) with one Solr node, one collection and one shard. I set the replicationFactor to 1. The name of the result core was set to mycollection_shard1_replica1. I then start a second Solr node and issue an ADDREPLICA command as described in the reference guide, using following code: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica(); addReplicaRequest.setCollectionName(mycollection); addReplicaRequest.setShardName(shard1); final CollectionAdminResponse response = addReplicaRequest.process(solrClient); The replica is added under a core named mycollection and not e.g. mycollection_shard1_replica2. I'd call that a bug. BTW, the example in the reference guide shows that issuing the request: http://localhost:8983/solr/admin/collections?action=ADDREPLICAcollection=test2shard=shard2node=192.167.1.2:8983_solr Results in response lst name=responseHeader int name=status0/int int name=QTime3764/int /lst lst name=success lst lst name=responseHeader int name=status0/int int name=QTime3450/int /lst * str name=coretest2_shard2_replica4/str* Did you try out a URL like that to see whether it also results in the misnamed core, or if it behaves correctly as the reference guide indicates? If the URL behaves correctly, I'd be curious what Solr logs for the URL request and the SolrJ request. Thanks, Shawn -- Anshum Gupta
Re: Auto naming replicas via ADDREPLICA
Either of them works for me. If you want to get your hands dirty, please go ahead. I can review/provide feedback if you need anything there. I'll just create a JIRA to begin with. On Tue, Mar 24, 2015 at 9:15 AM, Shai Erera ser...@gmail.com wrote: I use vanilla 5.0. I intended to fix it myself, but if you want to go ahead, I'd be happy to review the patch. Shai On Tue, Mar 24, 2015 at 6:11 PM, Anshum Gupta ans...@anshumgupta.net wrote: It's certainly looks like a bug and the name shouldn't be added to the request automatically. Can you confirm what version of Solr are you using? If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to both #1 and #2. On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera ser...@gmail.com wrote: Shawn, that was a great tip! When I tried the URL, the core was named as expected (mycollection_shard1_replica2). I then compared the URLs as reported in the logs, and I believe I found the bug: SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1 *name=mycollection*action=ADDREPLICA*collection=mycollection* wt=javabinversion=2} The 'name' param isn't set when I send the URL request (and it's also not specified in the reference guide), but only when I add the replica using SolrJ. I then tweaked my code to do the following: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica() { @Override public SolrParams getParams() { final ModifiableSolrParams params = (ModifiableSolrParams) super.getParams(); params.remove(CoreAdminParams.NAME); return params; } }; And voila, the core is now also named mycollection_shard1_replica2, and I'm even able to add as many replicas as I want on this node (where before it failed since 'mycollection' already existed). The 'name' parameter is added by CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix it: 1. Remove it in AddReplica.getParams() -- replicas will always be auto-named. It makes sense as users didn't have control over it before, and maybe they shouldn't. 2. Add a setCoreName to AddReplica request -- this would be nice if someone wanted to control the name of the added replica, but otherwise should not be included in the request Or maybe we fix the bug by doing #1 and consider #2 as a new feature allow naming replicas? Shai On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey apa...@elyograg.org wrote: On 3/23/2015 9:27 AM, Shai Erera wrote: I have a Solr cluster started (all programmatically) with one Solr node, one collection and one shard. I set the replicationFactor to 1. The name of the result core was set to mycollection_shard1_replica1. I then start a second Solr node and issue an ADDREPLICA command as described in the reference guide, using following code: final CollectionAdminRequest.AddReplica addReplicaRequest = new CollectionAdminRequest.AddReplica(); addReplicaRequest.setCollectionName(mycollection); addReplicaRequest.setShardName(shard1); final CollectionAdminResponse response = addReplicaRequest.process(solrClient); The replica is added under a core named mycollection and not e.g. mycollection_shard1_replica2. I'd call that a bug. BTW, the example in the reference guide shows that issuing the request: http://localhost:8983/solr/admin/collections?action=ADDREPLICAcollection=test2shard=shard2node=192.167.1.2:8983_solr Results in response lst name=responseHeader int name=status0/int int name=QTime3764/int /lst lst name=success lst lst name=responseHeader int name=status0/int int name=QTime3450/int /lst * str name=coretest2_shard2_replica4/str* Did you try out a URL like that to see whether it also results in the misnamed core, or if it behaves correctly as the reference guide indicates? If the URL behaves correctly, I'd be curious what Solr logs for the URL request and the SolrJ request. Thanks, Shawn -- Anshum Gupta -- Anshum Gupta
Re: How to verify a document is indexed by all replicas
You can add a min_rf=true parameter to your indexing Yeah I read about it, but it doesn't help me as in this case, I'm implementing some monitoring component over a SolrCloud instance, so I have no handle to the indexing client. I would like the monitor to check the replicas and report something if all replicas are in sync, some are not in sync, or e.g. replicas 2 and 3 are further ahead than replica1. Also, checking the state of the replica is not enough, one should always check for the state=active and live-ness of the replica i.e. the node is marked live under /live_nodes in ZK. Thanks, I've looked at code samples in tests and saw this is done, so I copied the logic. E.g. an .isReplicaAlive(Replica replica) checks both the replica's state, as well that the node it's one is in the cluster state's live nodes. Also, verifying replicas are in sync via searching is not the best solution at all. Apart from not being that fast, it also doesn't factor in documents that are in the tlog, or in IW's RAM buffer, or even that a document may have been updated. So I will change my test to ensuring that all replicas of a slice are in state active (and on a live node) and rely on that being OK. Shai On Tue, Mar 24, 2015 at 6:39 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hi Shai, To your original question on how to know if a document has been indexed at all replicas -- You can add a min_rf=true parameter to your indexing request and then Solr will add information to the response about how many replicas gave an ack' to the leader. So if the returned number is equal to the number of replicas, you can be sure that the doc has been indexed everywhere. More comments inline: On Tue, Mar 24, 2015 at 8:18 AM, Shai Erera ser...@gmail.com wrote: Thanks Erick, When a replica is down, no updates are sent to it. When it comes back up, it discovers that it needs to catch-up with the leader. If there are many events it falls back to index replication (slower). During this period of time, is the replica considered ACTIVE or RECOVERING? It is marked as recovering. And, can I assume that at any given moment (aside from ZK connection timeouts etc.) when I check the replicas' state, all the ones that report ACTIVE are in sync with each other? Yes, 'active' replicas should be in sync but autoCommits can cause inconsistency between replicas as to what is visible to searchers (even if all replicas have indexed the same data). Also, checking the state of the replica is not enough, one should always check for the state=active and live-ness of the replica i.e. the node is marked live under /live_nodes in ZK. Shai On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson erickerick...@gmail.com wrote: You can always issue a *:* query, but it'd have to be at least your autoSoftCommit interval ago since the soft commit trigger will have slightly different wall clock times. But it shouldn't be necessary to wait I don't think. Since the indexing request doesn't succeed until the docs have been written to the tlogs, and since the tlogs will be replayed in the event of a problem your data should be fine. Of course if you're indexing at a very fast rate and your tlog is huge, it'll take a while FWIW, Erick On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera ser...@gmail.com wrote: Hi Is there a recommended, preferably fast, way to check that a document is indexed by all replicas? I currently do that by issuing a search request to each replica, but was wondering if there's a faster way. Even better, is there a way to verify all replicas of a shard are up-to-date, e.g. by comparing their version or something? By up-to-date I mean that they've all processed the same update requests that came through. If there's a replica lagging behind, I'd like to wait for it to catch up, something like a checkpoint(), before I continue sending more updates. Shai -- Regards, Shalin Shekhar Mangar.
Re: maxReplicasPerNode
Thanks guys, this makes sense I guess, from Solr's side. Perhaps we can have a new Collections API like REDIRECTREPLICA or something, that will redirect a replica to the new node. This API can simply do ADDREPLICA on the new node, and DELETEREPLICA of the node that doesn't exist anymore. I guess I need to implement that for my use case now (I know that if a node came down, it won't ever come back up again - there will be a new node replacing it), so I'll see how it plays out and if it works well, I'll open a JIRA issue. In my case, when the new node comes up, it can check the cluster's status, and if it detects an orphanage replica, it will add itself as a new replica and delete the orphanage one. Let me know if you see a problem with how I intend to address that. Shai On Tue, Mar 24, 2015 at 6:01 PM, Anshum Gupta ans...@anshumgupta.net wrote: Yes, it applies to both. Solr wouldn't auto-add replicas in either of those cases (or any other case) to meet the rf specified at create time. On Tue, Mar 24, 2015 at 2:22 AM, Shai Erera ser...@gmail.com wrote: Thanks Anshum, About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. Just to make sure the answer applies to both these cases: 1. There are two replicas on node1 and node2. Solr won't add a replica to node1 when node2 goes down. 2. The collection was created with rf=2, Solr creates replicas on node1 and node2. If node2 goes down and a node3 comes up instead, will it be assigned a replica, or Solr does not do that also? In short, is there any scenario where Solr would auto-add replicas (aside from running on HDFS) to meet the 'rf' setting, or after the collection has been created, ensuring RF is met is my responsibility? Shai On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Shai, As of now, all replicas for a collections are created to meet the specified replication factor at the time of collection creation. There's no way to defer that until more nodes are up. Your best bet is to have the nodes already up before you CREATE the collection or create the collection with a lower replication factor and then use ADDREPLICA. About auto-addition of replicas, that's kind of supported when using shared file system (HDFS) to host the index. It's doesn't truly work as per your use-case i.e. it doesn't consider the intended replication factor but only brings up a Replica in case all replicas for a node are down, so that SolrCloud continues to be usable. It also doesn't auto-remove replica when the old node comes back up. You can read more about this in the Automatically Add Replicas in SolrCloud section here: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote: Hi I saw that we can define maxShardsPerNode when creating a collection, but I don't see that I can set something similar for replicas. My scenario is the following: - I setup one Solr node - Create collection with numShards=1 and replicationFactor=2 - Hopefully, one replica is created on that node - When I bring up the second Solr node, the second replica will be created What I see is that both replicas are created on the first node, and when I bring up the second Solr node, none of the replicas are moved. I know that I can move one replica by calling ADDREPLICA on node2, then DELETEREPLICA on node1, but I was wondering if there's an automated way to do that. I've also considered creating the collection with replicationFactor=1 and when the second node comes up it will look for shards w/ one replica only, and assign themselves as the replica. But it means I have to own that piece of logic, where if Solr already does that, that's better. Also, from what I understand, if I create a collection w/ rf=2 and there are two nodes, then each node is assigned a replica. If one of the nodes comes down, and a 3rd node comes up, it will be assigned a replica -- is that correct? Another related question, if there are two replicas on node1 and node2, and node2 goes down -- will node1 be assigned the second replica as well? If this is explained somewhere, I'd appreciate if you can give me a pointer. Shai -- Anshum Gupta -- Anshum Gupta
One of three cores is missing userData and lastModified fields from /admin/cores
Hey All, On a Solr server running 4.10.2 with three cores, two return the expected info from /solr/admin/cores?wt=json but the third is missing userData and lastModified. The first (artists) and third (tracks) cores from the linked screenshot are the ones I care about. Unfortunately, the third (tracks) is the one missing lastModified. As far as I can see, that comes from: https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java#L568 I can't trace back to see what would possible cause getUserData() to return an empty Object, but that appears to be what is happening? For these severs, indexes that are pre-optimized are shipped over to the server and the server is re-started... nothing is actually ever committed on these live servers. This should behave exactly the same for artists and tracks, even though tracks is the one always missing lastUpdated. Here's the output in img format, I'll paste the full JSON[1] below: http://monosnap.com/image/XMyAfk5z3AvHgY39m0qAKAGlc3RACI.png I'd like to be able to provide access to clients to grab lastUpdated time for both indices so that they can see how old/stale the data they are getting results back from is... ...alternately, is there any other way to expose easily how old (last modified time?) the index for a core is? Thanks, Aaron 1: Full JSON ---snip--- { responseHeader: { status: 0, QTime: 10 }, defaultCoreName: collection1, initFailures: { }, status: { artists: { name: artists, isDefaultCore: false, instanceDir: /opt/solr/search/solr/artists/, dataDir: /opt/solr/search/solr/artists/, config: solrconfig.xml, schema: schema.xml, startTime: 2015-03-24T14:12:23.667Z, uptime: 7335696, index: { numDocs: 3360380, maxDoc: 3360380, deletedDocs: 0, indexHeapUsageBytes: 63366952, version: 421, segmentCount: 1, current: true, hasDeletions: false, directory: org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/artists/index lockFactory=NativeFSLockFactory@/opt/solr/search/solr/artists/index, userData: { commitTimeMSec: 1427133705908 }, lastModified: 2015-03-23T18:01:45.908Z, sizeInBytes: 25341305528, size: 23.6 GB } }, banana-int: { name: banana-int, isDefaultCore: false, instanceDir: /opt/solr/search/solr/banana-int/, dataDir: /opt/solr/search/solr/banana-int/data/, config: solrconfig.xml, schema: schema.xml, startTime: 2015-03-24T14:12:22.895Z, uptime: 7336472, index: { numDocs: 3, maxDoc: 3, deletedDocs: 0, indexHeapUsageBytes: 17448, version: 135, segmentCount: 3, current: true, hasDeletions: false, directory: org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/opt/solr/search/solr/banana-int/data/index lockFactory=NativeFSLockFactory@/opt/solr/search/solr/banana-int/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0), userData: { commitTimeMSec: 1412796723183 }, lastModified: 2014-10-08T19:32:03.183Z, sizeInBytes: 16196, size: 15.82 KB } }, tracks: { name: tracks, isDefaultCore: false, instanceDir: /opt/solr/search/solr/tracks/, dataDir: /opt/solr/search/solr/tracks/, config: solrconfig.xml, schema: schema.xml, startTime: 2015-03-24T14:12:23.656Z, uptime: 7335713, index: { numDocs: 53268126, maxDoc: 53268126, deletedDocs: 0, indexHeapUsageBytes: 517650552, version: 100, segmentCount: 1, current: true, hasDeletions: false, directory: org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/tracks/index lockFactory=NativeFSLockFactory@/opt/solr/search/solr/tracks/index, userData: { }, sizeInBytes: 122892905007, size: 114.45 GB } } } } ---snip---
Regarding detection of duplication
Hi, My requirement is to detect duplication in title after removing punctuation marks, stop words, accented characters. I am trying to do exact match . After that I am thinking of applying filters. I have tried solr. KeywordTokenizerFactory . It does exact matching. But when I add filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / Stop filter is not working. But If I apply solr.StandardTokenizerFactory , am not getting the exact match. Title: What is a apple? What is an apple? What is the apple? When I type What is a apple I need to get all the above. Could you please let me know that Is there any tokenizer/filter matching my requirement. -- View this message in context: http://lucene.472066.n3.nabble.com/Regarding-detection-of-duplication-tp4194975.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr and HDFS configuration
The ultimate answer is that you need to test your configuration with your expected workflow. However, the thing that mitigates the remote IO factor (hopefully) is that the Solr HDFS stuff features a blockcache that should (when tuned correctly) cache in RAM the blocks your Solr process needs the most. Solr on HDFS currently doesn't have any sort of rack locality like there is with say HBase colocated on the HDFS nodes. So you can expect that even with Solr installed on the same nodes as your datanodes for HDFS, that there will be remote IO. Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Mar 24, 2015 at 2:47 PM, Joseph Obernberger j...@lovehorsepower.com wrote: Hi All - does it make sense to run a solr shard on a node within an Hadoop cluster that is not a data node? In that case all the data that node processes would need to come over the network, but you get the benefit of more CPU for things like faceting. Thank you! -Joe
Re: rough maximum cores (shards) per machine?
On 3/24/2015 11:22 AM, Ian Rose wrote: Let me give a bit of background. Our Solr cluster is multi-tenant, where we use one collection for each of our customers. In many cases, these customers are very tiny, so their collection consists of just a single shard on a single Solr node. In fact, a non-trivial number of them are totally empty (e.g. trial customers that never did anything with their trial account). However there are also some customers that are larger, requiring their collection to be sharded. Our strategy is to try to keep the total documents in any one shard under 20 million (honestly not sure where my coworker got that number from - I am open to alternatives but I realize this is heavily app-specific). So my original question is not related to indexing or query traffic, but just the sheer number of cores. For example, if I have 10 active cores on a machine and everything is working fine, should I expect that everything will still work fine if I add 10 nearly-idle cores to that machine? What about 100? 1000? I figure the overhead of each core is probably fairly low but at some point starts to matter. One resource that may be exhausted faster than any other when you have a lot of cores on a solr instance (especially when they are not idle) is Java heap memory, so you might need to increase the java heap. Memory in the server is one of the most important resources you have for Solr performance, and here I am talking about memory that is *not* used in the Java heap (or any other program) -- the OS must be able to effectively cache your index data or Solr performance will be terrible. You have said Solr cluster and collection ... so that makes me think you're running SolrCloud. In cloud mode, you can't really use the LotsOfCores functionality, where you mark cores transient and tell Solr how many cores you'd like to have resident at the same time. If you are NOT in cloud mode, then you can use this feature: http://wiki.apache.org/solr/LotsOfCores In general, there are three resources other than memory which might become exhausted with a large number of cores: One resource is the maximum open files limit in the operating system, which typically defaults to 1024. Each core will typically have several dozen files in its index, so it's very easy to reach 1024 open files. The second resource is the maximum allowed threads in your servlet container config -- each core you add requires more threads. The default maxThreads value in most containers is 200. The Jetty container included in the Solr download is preconfigured with a maxThreads value of 1, effectively removing the limit for most setups. The third resource is related to the second -- some operating systems implement threads as hidden processes, and many operating systems will limit the number of processes that a user may start. On Linux, this limit is typically 1024, and may need to be increased. I really need to add this kind of info to the wiki. Thanks, Shawn
Solr and HDFS configuration
Hi All - does it make sense to run a solr shard on a node within an Hadoop cluster that is not a data node? In that case all the data that node processes would need to come over the network, but you get the benefit of more CPU for things like faceting. Thank you! -Joe
Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor
Hi Alex, Thanks again for the reply. See my response below inline. Am 22.03.2015 um 20:14 schrieb Alexandre Rafalovitch arafa...@gmail.com: I am not entirely sure your problem is at the XSL level yet? *) I see problems with quotes in two places (in datasource, and in outer entity). Did you paste definitions from MSWord by any chance? The file was created in a text editor. I am not sure which quotes you are referring to. They look fine to me and the XML file valides alright. Could you perhaps be more specific? *) I see that you declare outer entity to be rootEntity=true, so you will not get anything from inner documents That’s correct, I have set the value to „false now *) I don't see any XPath definitions in the inner entity, so the processor does not know how to actually map to the fields (that's different for SQLEntityProcessor which auto-maps). As far as I know, the explicit mappings are not required when the result of the transformation is in the Solr default import format. The documentation says: useSolrAddSchema - Set this to true if the content is in the form of the standard Solr update XML schema. (https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler) But maybe my interpretation here is incorrect. I was assuming that setting this attribute to „true“ will allow the DIH to directly process the resulting XML file as if I was importing it with the command line Java tool. I would step back from inner DIH entity and make sure your outer entity actually captures something. Maybe by enabling dynamicField * with stored=true. See what you get into the schema. Then, add XPath against original XML, just to make sure you capture _something_. Then, XSLT and XPath. OK, I will try to debug the DIH like this. Thanks again. Cheers, Martin Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 22 March 2015 at 12:36, Martin Wunderlich martin...@gmx.net wrote: Hi Alex, Thanks a lot for the reply and apologies for being unclear. The XPathEntityProcessor provides an option to specify an XSLT file that should be applied to the XML input prior to the actual data import. I am including my current configuration below, with the respective attribute highlighted. I have checked various forums and documentation bits, but the config XML seems ok to me. And yet, nothing gets imported. Cheers, Martin dataConfig dataSource encoding=UTF-8 type=„FileDataSource / entity name=pickupdir processor=FileListEntityProcessor rootEntity=true fileName=.*xml baseDir=„/abs/path/to/source/dir/for/import/ recursive=true newerThan=${dataimporter.last_index_time} dataSource=null entity name=xml processor=XPathEntityProcessor stream=false useSolrAddSchema=true url=${pickupdir.fileAbsolutePath} xsl=/abs/path/to/xslt/file/in/myCore/conf/transform.xsl /entity /entity /document /dataConfig Am 22.03.2015 um 01:18 schrieb Alexandre Rafalovitch arafa...@gmail.com mailto:arafa...@gmail.com: What do you mean using DIH with XSLT together? DIH uses a basic XPath parser, but not full XSLT. So, it's not very clear what the question actually means. How did you configure it all? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ http://www.solr-start.com/ On 21 March 2015 at 14:14, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am trying to create a data import handler (DIH) to import XML files. The source XML should be transformed using XSLT into the standard Solr import format. I have tested the XSLT and successfully imported data using the Java-based simple import tool. However, when I try to import the same XML files with the same XSLT pre-processing using a DIH configured in solrconfig.xml, it doesn’t work. I can execute the DIH from the admin interface, but no documents get imported. The logging console doesn’t give any errors. Could someone who has managed to successfully set up a similar configuration (XML import via DIH with XSL pre-processing), provide with the basic configuration, so that I can check what might be wrong in mine? Thanks a lot. Cheers, Martin
Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor
type=„FileDataSource / I am getting both missing closing quote and the opening quote is a funny one (aligns on the bottom). But your response email also does that, so maybe you are using some smart editor. Try checking this conversation in a web archive if you can't see the unusual quotes. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 March 2015 at 15:41, Martin Wunderlich martin...@gmx.net wrote: Hi Alex, Thanks again for the reply. See my response below inline. Am 22.03.2015 um 20:14 schrieb Alexandre Rafalovitch arafa...@gmail.com: I am not entirely sure your problem is at the XSL level yet? *) I see problems with quotes in two places (in datasource, and in outer entity). Did you paste definitions from MSWord by any chance? The file was created in a text editor. I am not sure which quotes you are referring to. They look fine to me and the XML file valides alright. Could you perhaps be more specific? *) I see that you declare outer entity to be rootEntity=true, so you will not get anything from inner documents That’s correct, I have set the value to „false now *) I don't see any XPath definitions in the inner entity, so the processor does not know how to actually map to the fields (that's different for SQLEntityProcessor which auto-maps). As far as I know, the explicit mappings are not required when the result of the transformation is in the Solr default import format. The documentation says: useSolrAddSchema - Set this to true if the content is in the form of the standard Solr update XML schema. (https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler) But maybe my interpretation here is incorrect. I was assuming that setting this attribute to „true“ will allow the DIH to directly process the resulting XML file as if I was importing it with the command line Java tool. I would step back from inner DIH entity and make sure your outer entity actually captures something. Maybe by enabling dynamicField * with stored=true. See what you get into the schema. Then, add XPath against original XML, just to make sure you capture _something_. Then, XSLT and XPath. OK, I will try to debug the DIH like this. Thanks again. Cheers, Martin Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 22 March 2015 at 12:36, Martin Wunderlich martin...@gmx.net wrote: Hi Alex, Thanks a lot for the reply and apologies for being unclear. The XPathEntityProcessor provides an option to specify an XSLT file that should be applied to the XML input prior to the actual data import. I am including my current configuration below, with the respective attribute highlighted. I have checked various forums and documentation bits, but the config XML seems ok to me. And yet, nothing gets imported. Cheers, Martin dataConfig dataSource encoding=UTF-8 type=„FileDataSource / entity name=pickupdir processor=FileListEntityProcessor rootEntity=true fileName=.*xml baseDir=„/abs/path/to/source/dir/for/import/ recursive=true newerThan=${dataimporter.last_index_time} dataSource=null entity name=xml processor=XPathEntityProcessor stream=false useSolrAddSchema=true url=${pickupdir.fileAbsolutePath} xsl=/abs/path/to/xslt/file/in/myCore/conf/transform.xsl /entity /entity /document /dataConfig Am 22.03.2015 um 01:18 schrieb Alexandre Rafalovitch arafa...@gmail.com mailto:arafa...@gmail.com: What do you mean using DIH with XSLT together? DIH uses a basic XPath parser, but not full XSLT. So, it's not very clear what the question actually means. How did you configure it all? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ http://www.solr-start.com/ On 21 March 2015 at 14:14, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am trying to create a data import handler (DIH) to import XML files. The source XML should be transformed using XSLT into the standard Solr import format. I have tested the XSLT and successfully imported data using the Java-based simple import tool. However, when I try to import the same XML files with the same XSLT pre-processing using a DIH configured in solrconfig.xml, it doesn’t work. I can execute the DIH from the admin interface, but no documents get imported. The logging console doesn’t give any errors. Could someone who has managed to successfully set up a similar configuration
RE: rough maximum cores (shards) per machine?
Jack Krupansky [jack.krupan...@gmail.com] wrote: Don't confuse customers and tenants. Perhaps you could explain what you mean by multi-tenant in the context of Ian's setup? It is not clear to me what the distinction is in this case. - Toke Eskildsen
Re: rough maximum cores (shards) per machine?
Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xml at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69) at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90) at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) ... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95) at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) ... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory at java.lang.Class.asSubclass(Class.java:3208) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593) at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342) at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards. Le Mardi 24 mars 2015 21h24, Jack Krupansky jack.krupan...@gmail.com a écrit : I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Ever hear the old saying Too many cooks spoil the stew? -- Jack Krupansky On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: Don't confuse customers and tenants. Perhaps you could explain what you mean by multi-tenant in the context of Ian's setup? It is not clear to me what the distinction is in this case. - Toke Eskildsen
RE: rough maximum cores (shards) per machine?
Jack Krupansky [jack.krupan...@gmail.com] wrote: I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. It was my understanding that Ian used them interchangeably, but of course Ian it the only one that knows. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Thank you. Now your post makes a lot more sense. I will not argue against that. - Toke Eskildsen
Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor
Very interesting. Thanks, Shawn. Here is what the config file looks like in the Solr admin console: https://www.dropbox.com/s/qtfclbvs8oze7lp/Bildschirmfoto%202015-03-24%20um%2021.11.12.png?dl=0 https://www.dropbox.com/s/qtfclbvs8oze7lp/Bildschirmfoto%202015-03-24%20um%2021.11.12.png?dl=0 No problems with quotes here. It might have been Apple Mail that converted them. Cheers, Martin Am 24.03.2015 um 20:59 schrieb Shawn Heisey apa...@elyograg.org: On 3/24/2015 1:41 PM, Martin Wunderlich wrote: The file was created in a text editor. I am not sure which quotes you are referring to. They look fine to me and the XML file valides alright. Could you perhaps be more specific? This partial screenshot is your email to the list showing your dataconfig, as I see it in Thunderbird, with the unusual quote characters clearly indicated: https://www.dropbox.com/s/rbycy69xq4bn42l/solr-user-martin-wunderlich-quote-problem.png?dl=0 Thanks, Shawn
Re: rough maximum cores (shards) per machine?
I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. For me, it's a question of who has control over the config and schema and collection creation. Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Ever hear the old saying Too many cooks spoil the stew? -- Jack Krupansky On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: Don't confuse customers and tenants. Perhaps you could explain what you mean by multi-tenant in the context of Ian's setup? It is not clear to me what the distinction is in this case. - Toke Eskildsen
Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor
On 3/24/2015 1:41 PM, Martin Wunderlich wrote: The file was created in a text editor. I am not sure which quotes you are referring to. They look fine to me and the XML file valides alright. Could you perhaps be more specific? This partial screenshot is your email to the list showing your dataconfig, as I see it in Thunderbird, with the unusual quote characters clearly indicated: https://www.dropbox.com/s/rbycy69xq4bn42l/solr-user-martin-wunderlich-quote-problem.png?dl=0 Thanks, Shawn
Re: rough maximum cores (shards) per machine?
Multi-tenancy is a bad idea for a single solr Cluster. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Think about it: If there are a small number of tenants, just giving each their own machine will be cheaper than the effort spent managing a multi-tenant cluster, and if there are a large number of tenants of even a moderate number of large tenants, you can't expect them to all run reasonably on a relatively small cluster. Think about scalability. -- Jack Krupansky On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose ianr...@fullstory.com wrote: Let me give a bit of background. Our Solr cluster is multi-tenant, where we use one collection for each of our customers. In many cases, these customers are very tiny, so their collection consists of just a single shard on a single Solr node. In fact, a non-trivial number of them are totally empty (e.g. trial customers that never did anything with their trial account). However there are also some customers that are larger, requiring their collection to be sharded. Our strategy is to try to keep the total documents in any one shard under 20 million (honestly not sure where my coworker got that number from - I am open to alternatives but I realize this is heavily app-specific). So my original question is not related to indexing or query traffic, but just the sheer number of cores. For example, if I have 10 active cores on a machine and everything is working fine, should I expect that everything will still work fine if I add 10 nearly-idle cores to that machine? What about 100? 1000? I figure the overhead of each core is probably fairly low but at some point starts to matter. Does that make sense? - Ian On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Shards per collection, or across all collections on the node? It will all depend on: 1. Your ingestion/indexing rate. High, medium or low? 2. Your query access pattern. Note that a typical query fans out to all shards, so having more shards than CPU cores means less parallelism. 3. How many collections you will have per node. In short, it depends on what you want to achieve, not some limit of Solr per se. Why are you even sharding the node anyway? Why not just run with a single shard per node, and do sharding by having separate nodes, to maximize parallel processing and availability? Also be careful to be clear about using the Solr term shard (a slice, across all replica nodes) as distinct from the Elasticsearch term shard (a single slice of an index for a single replica, analogous to a Solr core.) -- Jack Krupansky On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: rough maximum cores (shards) per machine?
Sorry Jack. That doesn't scale when you have millions of customers. And these are good problems to have! On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Multi-tenancy is a bad idea for a single solr Cluster. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Think about it: If there are a small number of tenants, just giving each their own machine will be cheaper than the effort spent managing a multi-tenant cluster, and if there are a large number of tenants of even a moderate number of large tenants, you can't expect them to all run reasonably on a relatively small cluster. Think about scalability. -- Jack Krupansky On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose ianr...@fullstory.com wrote: Let me give a bit of background. Our Solr cluster is multi-tenant, where we use one collection for each of our customers. In many cases, these customers are very tiny, so their collection consists of just a single shard on a single Solr node. In fact, a non-trivial number of them are totally empty (e.g. trial customers that never did anything with their trial account). However there are also some customers that are larger, requiring their collection to be sharded. Our strategy is to try to keep the total documents in any one shard under 20 million (honestly not sure where my coworker got that number from - I am open to alternatives but I realize this is heavily app-specific). So my original question is not related to indexing or query traffic, but just the sheer number of cores. For example, if I have 10 active cores on a machine and everything is working fine, should I expect that everything will still work fine if I add 10 nearly-idle cores to that machine? What about 100? 1000? I figure the overhead of each core is probably fairly low but at some point starts to matter. Does that make sense? - Ian On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Shards per collection, or across all collections on the node? It will all depend on: 1. Your ingestion/indexing rate. High, medium or low? 2. Your query access pattern. Note that a typical query fans out to all shards, so having more shards than CPU cores means less parallelism. 3. How many collections you will have per node. In short, it depends on what you want to achieve, not some limit of Solr per se. Why are you even sharding the node anyway? Why not just run with a single shard per node, and do sharding by having separate nodes, to maximize parallel processing and availability? Also be careful to be clear about using the Solr term shard (a slice, across all replica nodes) as distinct from the Elasticsearch term shard (a single slice of an index for a single replica, analogous to a Solr core.) -- Jack Krupansky On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose ianr...@fullstory.com wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian -- Regards, Shalin Shekhar Mangar.
maxReplicasPerNode
Hi I saw that we can define maxShardsPerNode when creating a collection, but I don't see that I can set something similar for replicas. My scenario is the following: - I setup one Solr node - Create collection with numShards=1 and replicationFactor=2 - Hopefully, one replica is created on that node - When I bring up the second Solr node, the second replica will be created What I see is that both replicas are created on the first node, and when I bring up the second Solr node, none of the replicas are moved. I know that I can move one replica by calling ADDREPLICA on node2, then DELETEREPLICA on node1, but I was wondering if there's an automated way to do that. I've also considered creating the collection with replicationFactor=1 and when the second node comes up it will look for shards w/ one replica only, and assign themselves as the replica. But it means I have to own that piece of logic, where if Solr already does that, that's better. Also, from what I understand, if I create a collection w/ rf=2 and there are two nodes, then each node is assigned a replica. If one of the nodes comes down, and a 3rd node comes up, it will be assigned a replica -- is that correct? Another related question, if there are two replicas on node1 and node2, and node2 goes down -- will node1 be assigned the second replica as well? If this is explained somewhere, I'd appreciate if you can give me a pointer. Shai
Re: maxReplicasPerNode
Hi Shai, As of now, all replicas for a collections are created to meet the specified replication factor at the time of collection creation. There's no way to defer that until more nodes are up. Your best bet is to have the nodes already up before you CREATE the collection or create the collection with a lower replication factor and then use ADDREPLICA. About auto-addition of replicas, that's kind of supported when using shared file system (HDFS) to host the index. It's doesn't truly work as per your use-case i.e. it doesn't consider the intended replication factor but only brings up a Replica in case all replicas for a node are down, so that SolrCloud continues to be usable. It also doesn't auto-remove replica when the old node comes back up. You can read more about this in the Automatically Add Replicas in SolrCloud section here: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS About #3, i line with my answer to the previous question, Solr wouldn't auto-add a Replica to meet the replication factor when a node goes down. On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera ser...@gmail.com wrote: Hi I saw that we can define maxShardsPerNode when creating a collection, but I don't see that I can set something similar for replicas. My scenario is the following: - I setup one Solr node - Create collection with numShards=1 and replicationFactor=2 - Hopefully, one replica is created on that node - When I bring up the second Solr node, the second replica will be created What I see is that both replicas are created on the first node, and when I bring up the second Solr node, none of the replicas are moved. I know that I can move one replica by calling ADDREPLICA on node2, then DELETEREPLICA on node1, but I was wondering if there's an automated way to do that. I've also considered creating the collection with replicationFactor=1 and when the second node comes up it will look for shards w/ one replica only, and assign themselves as the replica. But it means I have to own that piece of logic, where if Solr already does that, that's better. Also, from what I understand, if I create a collection w/ rf=2 and there are two nodes, then each node is assigned a replica. If one of the nodes comes down, and a 3rd node comes up, it will be assigned a replica -- is that correct? Another related question, if there are two replicas on node1 and node2, and node2 goes down -- will node1 be assigned the second replica as well? If this is explained somewhere, I'd appreciate if you can give me a pointer. Shai -- Anshum Gupta
Problem with Terms Query Parser
Hi, I'm trying to use Terms Query Parser for one of my use cases where I use an implicit filter on bunch of sources. When I'm trying to run the following query, fq={!terms f=Source}help,documentation,sfdc I'm getting the following error. lst name=errorstr name=msgUnknown query parser 'terms'/strint name=code400/int/lst What am I missing here ? I'm using Solr 5.0 version. Any pointers will be appreciated. Regards, Shamik