TermsComponent/SolrCloud
Anyone knows if the TermsComponent supports distributed search trough a SolrCloud installation? I have a SolrCloud installation that works OK for regular searches but TermsComponent is returning empty results when using: [collectionName]/terms?terms.fl=collector_nameterms.prefix=jo, the request handler configuration is: !-- A request handler for demonstrating the terms component -- requestHandler name=/terms class=solr.SearchHandler startup=lazy lst name=defaults bool name=termstrue/bool bool name=distribtrue/bool /lst arr name=components strterms/str /arr /requestHandler
Re: SolrCloud and external Zookeeper ensemble
Hello, I´ve been dealing with the same question these days. In architecture terms, it´s always better to separate services (Solr and Zookeeper, in this case) rather to keep them in a single instance. However, when we have to deal with costs issues, all of use we are quite limitated and we must elect the best architecture/scalable/single point of failure option. As I see, the options are: *1. *Solr servers with Zookeeper embeded. *2. *Solr servers with external Zookeeper. *3.* Solr servers with external Zookeeper ensemble. *Note*: as far as I know, the recommended number of Zookeeper services to avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have The best option is the third one. Reasons: *1. *If one of your Solr servers goes down, Zookeeper services still up. *2.* If one of your Zookeeper services goes down, Solr servers and the rest of Zookeeper services still up. Considering that option, we have two ways to implement it in production: *1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine that we have 2 shards for a given collection, so we need at least 4 Solr servers to complete the leader-replica configuration. The best option is to deploy them in for Amazon instances, one per each server. We need at least 3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way to install them is in separates machines (micro instance will be nice for Zookeeper), so we will have 7 Amazon instances. The reason is that if one machine goes down (Solr or Zookeeper one) the others services may still up and your production environment will be safe. However,* for me this is the best case, but it´s the more expensive one*, so in my case is imposible to make real. *2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I would install three Amazon instances with Solr and Zookeeper, and one of them only with Solr. So we´ll have: 3 complete Amazon instances (Solr + Zookeeper) and 1 single Amazon instance (only Solr). If one of them goes down, the production environment will be safe. This architecture is not the best one, as I told you, but I think that is optimal in terms of robustness, single point of failure and costs. It would be a pleasure to hear new suggestions from other people that dealed with this kind of issues. Regards, - Luis Cappa. 2012/11/21 Marcin Rzewucki mrzewu...@gmail.com Yes, I meant the same (not -zkRun). However, I was asking if it is safe to have zookeeper and solr processes running on the same node or better on different machines? On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote: Hello! As I told I wouldn't use the Zookeeper that is embedded into Solr, but rather setup a standalone one. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch First of all: thank you for your answers. Yes, I meant side by side configuration. I think the worst case for ZKs here is to loose two of them. However, I'm going to use 4 availability zones in same region so at least this will reduce the risk of loosing both of them at the same time. Regards. On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote: Hello! Zookeeper by itself is not demanding, but if something happens to your nodes that have Solr on it, you'll loose ZooKeeper too if you have them installed side by side. However if you will have 4 Solr nodes and 3 ZK instances you can get them running side by side. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Separate is generally nice because then you can restart Solr nodes without consideration for ZooKeeper. Performance-wise, I doubt it's a big deal either way. - Mark On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have 4 solr collections, 2-3mn documents per collection, up to 100K updates per collection daily (roughly). I'm going to create SolrCloud4x on Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The question is what about zookeeper? It's going to be external ensemble, but is it better to use same nodes as solr or dedicated micro instances? Zookeeper does not seem to be resources demanding process, but what would be better in this case ? To keep it inside of solrcloud or separately (micro instances seem to be enough here) ? Thanks in advance. Regards. -- - Luis Cappa
Re: SolrCloud and external Zookeeper ensemble
Yes, this is exactly my case. I prefer 3rd option too. As I have 2 more instances to be used for my purposes (SolrCloud4x + 2 more instances for loading) it will be easier to configure zookeeper ensemble (as I can use those 2 additional machines + 1 from SolrCloud) and avoid more instances to be purchased and maintained. On 22 November 2012 10:18, Luis Cappa Banda luisca...@gmail.com wrote: Hello, I´ve been dealing with the same question these days. In architecture terms, it´s always better to separate services (Solr and Zookeeper, in this case) rather to keep them in a single instance. However, when we have to deal with costs issues, all of use we are quite limitated and we must elect the best architecture/scalable/single point of failure option. As I see, the options are: *1. *Solr servers with Zookeeper embeded. *2. *Solr servers with external Zookeeper. *3.* Solr servers with external Zookeeper ensemble. *Note*: as far as I know, the recommended number of Zookeeper services to avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have The best option is the third one. Reasons: *1. *If one of your Solr servers goes down, Zookeeper services still up. *2.* If one of your Zookeeper services goes down, Solr servers and the rest of Zookeeper services still up. Considering that option, we have two ways to implement it in production: *1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine that we have 2 shards for a given collection, so we need at least 4 Solr servers to complete the leader-replica configuration. The best option is to deploy them in for Amazon instances, one per each server. We need at least 3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way to install them is in separates machines (micro instance will be nice for Zookeeper), so we will have 7 Amazon instances. The reason is that if one machine goes down (Solr or Zookeeper one) the others services may still up and your production environment will be safe. However,* for me this is the best case, but it´s the more expensive one*, so in my case is imposible to make real. *2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I would install three Amazon instances with Solr and Zookeeper, and one of them only with Solr. So we´ll have: 3 complete Amazon instances (Solr + Zookeeper) and 1 single Amazon instance (only Solr). If one of them goes down, the production environment will be safe. This architecture is not the best one, as I told you, but I think that is optimal in terms of robustness, single point of failure and costs. It would be a pleasure to hear new suggestions from other people that dealed with this kind of issues. Regards, - Luis Cappa. 2012/11/21 Marcin Rzewucki mrzewu...@gmail.com Yes, I meant the same (not -zkRun). However, I was asking if it is safe to have zookeeper and solr processes running on the same node or better on different machines? On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote: Hello! As I told I wouldn't use the Zookeeper that is embedded into Solr, but rather setup a standalone one. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch First of all: thank you for your answers. Yes, I meant side by side configuration. I think the worst case for ZKs here is to loose two of them. However, I'm going to use 4 availability zones in same region so at least this will reduce the risk of loosing both of them at the same time. Regards. On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote: Hello! Zookeeper by itself is not demanding, but if something happens to your nodes that have Solr on it, you'll loose ZooKeeper too if you have them installed side by side. However if you will have 4 Solr nodes and 3 ZK instances you can get them running side by side. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Separate is generally nice because then you can restart Solr nodes without consideration for ZooKeeper. Performance-wise, I doubt it's a big deal either way. - Mark On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have 4 solr collections, 2-3mn documents per collection, up to 100K updates per collection daily (roughly). I'm going to create SolrCloud4x on Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The question is what about zookeeper? It's going to be external ensemble, but is it better to use same nodes as solr or dedicated micro instances? Zookeeper does not seem to be resources demanding process, but what would be better in this case ? To keep it inside of solrcloud or separately (micro instances
Re: TermsComponent/SolrCloud
Hi Federico, it should work. Make sure you set the shards.qt parameter too (in your case, it should be shards.qt=/terms) On Thu, Nov 22, 2012 at 6:51 AM, Federico Méndez federic...@gmail.comwrote: Anyone knows if the TermsComponent supports distributed search trough a SolrCloud installation? I have a SolrCloud installation that works OK for regular searches but TermsComponent is returning empty results when using: [collectionName]/terms?terms.fl=collector_nameterms.prefix=jo, the request handler configuration is: !-- A request handler for demonstrating the terms component -- requestHandler name=/terms class=solr.SearchHandler startup=lazy lst name=defaults bool name=termstrue/bool bool name=distribtrue/bool /lst arr name=components strterms/str /arr /requestHandler
Re: How to use eDismax query parser on a non tokenized field
You can either escape the whitespace with \ or search as a phrase. fieldNonTokenized:foo\ bar ...or... fieldNonTokenized:foo bar On Thu, Nov 22, 2012 at 9:08 AM, Varun Thacker varunthacker1...@gmail.comwrote: I have indexed documents using a fieldType which does not break the word up. I confirmed this by looking up the index in luke. I can see that the words haven't been tokenized. I use a search handler which uses edismax query parser for searching. According to the wiki also http://wiki.apache.org/solr/ExtendedDisMax#Query_Structure Extended DisMax breaks up the query string into words before searching. Thus no results show up. Example for q=foo bar: In the index : fieldNonTokenized:foo bar And when searching this is the final query getting made is: ((fieldNonTokenized:foo:foo)~0.01 (fieldNonTokenized:foo:bar)~0.01)~1 Thus no document matches and returns no result. I can understand why this is happening. Is there any way where I can say that the query string should not be broken up into words? -- Regards, Varun Thacker http://www.vthacker.in/
Re: TermsComponent/SolrCloud
Thanks Tomas, your suggestion worked!! requestHandler name=/terms class=solr.SearchHandler startup=lazy lst name=defaults bool name=termstrue/bool bool name=distribtrue/bool str name=shards.qt/terms/str /lst arr name=components strterms/str /arr /requestHandler On Thu, Nov 22, 2012 at 11:59 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: Hi Federico, it should work. Make sure you set the shards.qt parameter too (in your case, it should be shards.qt=/terms) On Thu, Nov 22, 2012 at 6:51 AM, Federico Méndez federic...@gmail.com wrote: Anyone knows if the TermsComponent supports distributed search trough a SolrCloud installation? I have a SolrCloud installation that works OK for regular searches but TermsComponent is returning empty results when using: [collectionName]/terms?terms.fl=collector_nameterms.prefix=jo, the request handler configuration is: !-- A request handler for demonstrating the terms component -- requestHandler name=/terms class=solr.SearchHandler startup=lazy lst name=defaults bool name=termstrue/bool bool name=distribtrue/bool /lst arr name=components strterms/str /arr /requestHandler
Re: Suggester for numbers
Hello Illu, Here you go: field name='autocomplete' type='text_auto' indexed='true' stored='true' multiValued='true'/ fieldType class=solr.TextField name=text_auto analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name='conteiner' type='text_general_like' indexed='true' stored='true' multiValued='false'/ fieldType class=solr.TextField name=text_general_like positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory maxGramSize=25 minGramSize=1/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ISOLatin1AccentFilterFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/Suggester-for-numbers-tp4021672p4021828.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud and exernal file fields
Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by
Performance improvement for solr faceting on large index
Hi All, We are using solr 3.4 with following schema fields. schema.xml--- fieldType name=autosuggest_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=5 outputUnigrams=true/ filter class=solr.PatternReplaceFilterFactory pattern=^([0-9. ])*$ replacement= replace=all/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=id type=string stored=true indexed=true/ field name=autoSuggestContent type=autosuggest_text stored=true indexed=true multiValued=true/ copyField source=content dest=autoSuggestContent/ copyField source=original_title dest=autoSuggestContent/ field name=content type=text stored=true indexed=true/ field name=original_title type=text stored=true indexed=true/ field name=site type=site stored=false indexed=true/ /schema.xml--- The index on above schema is distributed on two solr shards with each index size of about 1.2 million, and size on disk of about 195GB per shard. We want to retrieve (site, autoSuggestContent term, frequency of the term) information from our above main solr index. The site is a field in document and contains name of site to which that document belongs. The terms are retrieved from multivalued field autoSuggestContent which is created using shingles from content and title of the web page. As of now, we are using facet query to retrieve (term, frequency of term) for each site. Below is a sample query (you may ignore initial part of query) http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index The problem is that with increase in index size, this method has started taking huge time. It used to take 7 minutes per site with index size of 0.4 million docs but takes around 60-90 minutes for index size of 2.5 million(). With this speed, it will take around 5-6 days to index complete 1500 sites. Also we are expecting the index size to grow with more documents and more sites and as such time to get the above information will increase further. Please let us know if there is any better way to extract (site, term, frequency) information compare to current method. Thanks, Pravin Agrawal DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Performance improvement for solr faceting on large index
you could always try the fc facet method and maybe increase the filtercache size On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal pravin_agra...@persistent.co.in wrote: Hi All, We are using solr 3.4 with following schema fields. schema.xml--- fieldType name=autosuggest_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=5 outputUnigrams=true/ filter class=solr.PatternReplaceFilterFactory pattern=^([0-9. ])*$ replacement= replace=all/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=id type=string stored=true indexed=true/ field name=autoSuggestContent type=autosuggest_text stored=true indexed=true multiValued=true/ copyField source=content dest=autoSuggestContent/ copyField source=original_title dest=autoSuggestContent/ field name=content type=text stored=true indexed=true/ field name=original_title type=text stored=true indexed=true/ field name=site type=site stored=false indexed=true/ /schema.xml--- The index on above schema is distributed on two solr shards with each index size of about 1.2 million, and size on disk of about 195GB per shard. We want to retrieve (site, autoSuggestContent term, frequency of the term) information from our above main solr index. The site is a field in document and contains name of site to which that document belongs. The terms are retrieved from multivalued field autoSuggestContent which is created using shingles from content and title of the web page. As of now, we are using facet query to retrieve (term, frequency of term) for each site. Below is a sample query (you may ignore initial part of query) http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index The problem is that with increase in index size, this method has started taking huge time. It used to take 7 minutes per site with index size of 0.4 million docs but takes around 60-90 minutes for index size of 2.5 million(). With this speed, it will take around 5-6 days to index complete 1500 sites. Also we are expecting the index size to grow with more documents and more sites and as such time to get the above information will increase further. Please let us know if there is any better way to extract (site, term, frequency) information compare to current method. Thanks, Pravin Agrawal DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: SolrCloud and external Zookeeper ensemble
That's a tradeoff for you to make based on your own requirements, but the point is that it is LESS SAFE to run zookeeper on the same machine as a Solr instance. Also keep in mind that the goal is to have at least THREE zookeeper instances running at any moment, so if you run zookeeper on the same machine as a Solr instance, you will need more than three zookeepeers. Figure three plus the MAXIMUM number of Solr nodes that you expect could be down simultaneously. Also keep in mind that SolrCloud is about scaling, but the intention is NOT to scale the zookeeper ensemble linearly with the number of Solr nodes. That means you would have to deal with the messiness of sometimes running zookeeper with Solr and sometimes not. So, unless you are running a very small SolrCloud cluster, you are much better off keeping zookeeper off your Solr machines. The intent is that there will be a relatively small ensemble of zookeepers that service a large army or armada of Solr nodes. -- Jack Krupansky -Original Message- From: Marcin Rzewucki Sent: Wednesday, November 21, 2012 5:06 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud and external Zookeeper ensemble Yes, I meant the same (not -zkRun). However, I was asking if it is safe to have zookeeper and solr processes running on the same node or better on different machines? On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote: Hello! As I told I wouldn't use the Zookeeper that is embedded into Solr, but rather setup a standalone one. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch First of all: thank you for your answers. Yes, I meant side by side configuration. I think the worst case for ZKs here is to loose two of them. However, I'm going to use 4 availability zones in same region so at least this will reduce the risk of loosing both of them at the same time. Regards. On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote: Hello! Zookeeper by itself is not demanding, but if something happens to your nodes that have Solr on it, you'll loose ZooKeeper too if you have them installed side by side. However if you will have 4 Solr nodes and 3 ZK instances you can get them running side by side. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Separate is generally nice because then you can restart Solr nodes without consideration for ZooKeeper. Performance-wise, I doubt it's a big deal either way. - Mark On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have 4 solr collections, 2-3mn documents per collection, up to 100K updates per collection daily (roughly). I'm going to create SolrCloud4x on Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The question is what about zookeeper? It's going to be external ensemble, but is it better to use same nodes as solr or dedicated micro instances? Zookeeper does not seem to be resources demanding process, but what would be better in this case ? To keep it inside of solrcloud or separately (micro instances seem to be enough here) ? Thanks in advance. Regards.
Re: From Solr3.1 to SolrCloud
I run a separate Zookeeper instance right now. Works great, nodes are visible in admin. Two more questions: - I change my synonyms.txt on a solr node. How can i get zookeeper in sync and the other solr nodes without restart? - I read something more about zookeeper ensemble. When i need to run with 4 solr nodes(replicas) i need 3 zookeepers in ensemble(50% live). When zookeeper and solr are separated it will takes 7 servers to get it live. In the past we only needed 4 servers. Are there some other options because the costs will grow? 3 zookeeper servers sounds like overkill. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/From-Solr3-1-to-SolrCloud-tp4021536p4021849.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Partial results with not enough hits
Hi, Maybe your goal should be to make your queries faster instead of fighting with timeouts which are known not to work well. What is your hardware like? How about your queries? What do you see in debugQuery=true output? Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 21, 2012 6:04 PM, Aleksey Vorona avor...@ea.com wrote: In all of my queries I have timeAllowed parameter. My application is ready for partial results. However, whenever Solr returns partial result it is a very bad result. For example, I have a test query and here its execution log with the strict time allowed: WARNING: Query: omitted; Elapsed time: 120Exceeded allowed search time: 100 ms. INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**100} hits=189 status=0 QTime=119 Here it is without such a strict limitation: INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**1} hits=582 status=0 QTime=124 The total execution time is different by mere 5 ms, but the partial result has only about 1/3 of the full result. Is it the expected behaviour? Does that mean I can never rely on the partial results? I added timeAllowed to protect from too expensive wide queries, but I still want to return something relevant to the user. This query returned 30% of the full result, but I have other queries in the log where partial result is just empty. Am I doing something wrong? P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory. Load Average on the Solr box is very low. -- Aleksey
Re: SolrCloud and external Zookeeper ensemble
If your Solr instances don't max out your ec2 instances you should be fine. But maybe even micro instances will suffice. Or 1 on demand and 2 spot ones. If cost is the concern, that is. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 21, 2012 5:07 PM, Marcin Rzewucki mrzewu...@gmail.com wrote: Yes, I meant the same (not -zkRun). However, I was asking if it is safe to have zookeeper and solr processes running on the same node or better on different machines? On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote: Hello! As I told I wouldn't use the Zookeeper that is embedded into Solr, but rather setup a standalone one. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch First of all: thank you for your answers. Yes, I meant side by side configuration. I think the worst case for ZKs here is to loose two of them. However, I'm going to use 4 availability zones in same region so at least this will reduce the risk of loosing both of them at the same time. Regards. On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote: Hello! Zookeeper by itself is not demanding, but if something happens to your nodes that have Solr on it, you'll loose ZooKeeper too if you have them installed side by side. However if you will have 4 Solr nodes and 3 ZK instances you can get them running side by side. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Separate is generally nice because then you can restart Solr nodes without consideration for ZooKeeper. Performance-wise, I doubt it's a big deal either way. - Mark On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have 4 solr collections, 2-3mn documents per collection, up to 100K updates per collection daily (roughly). I'm going to create SolrCloud4x on Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The question is what about zookeeper? It's going to be external ensemble, but is it better to use same nodes as solr or dedicated micro instances? Zookeeper does not seem to be resources demanding process, but what would be better in this case ? To keep it inside of solrcloud or separately (micro instances seem to be enough here) ? Thanks in advance. Regards.
Re: SolrCloud and exernal file fields
On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote: around 7M documents in the index; each document has a 45 character ID. 7M documents isn't that large. Is there a reason why you need so many shards (16 in your case) on a single box? -Yonik http://lucidworks.com
Re: SolrCloud and external Zookeeper ensemble
That is an interesting point - what size of instance is needed for a zookeeper. Can it run well in a micro? Another issue I wanted to raise is that maybe questions, advice, and guidelines should be relative to the shirt size of your cluster - small, medium, or large. SolrCloud is clearly more optimized for medium to large clusters. Sure, you can use it for small clusters, but then some of the features and guidance do seem like overkill. Nonetheless, I would hate to see anybody take the compromised guidance for very small clusters (3 or 4 machines) and apply it to even medium-size clusters (10 to 20 machines), let alone large clusters (dozens to 100 or more machines). -- Jack Krupansky -Original Message- From: Otis Gospodnetic Sent: Thursday, November 22, 2012 9:37 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud and external Zookeeper ensemble If your Solr instances don't max out your ec2 instances you should be fine. But maybe even micro instances will suffice. Or 1 on demand and 2 spot ones. If cost is the concern, that is. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 21, 2012 5:07 PM, Marcin Rzewucki mrzewu...@gmail.com wrote: Yes, I meant the same (not -zkRun). However, I was asking if it is safe to have zookeeper and solr processes running on the same node or better on different machines? On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote: Hello! As I told I wouldn't use the Zookeeper that is embedded into Solr, but rather setup a standalone one. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch First of all: thank you for your answers. Yes, I meant side by side configuration. I think the worst case for ZKs here is to loose two of them. However, I'm going to use 4 availability zones in same region so at least this will reduce the risk of loosing both of them at the same time. Regards. On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote: Hello! Zookeeper by itself is not demanding, but if something happens to your nodes that have Solr on it, you'll loose ZooKeeper too if you have them installed side by side. However if you will have 4 Solr nodes and 3 ZK instances you can get them running side by side. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Separate is generally nice because then you can restart Solr nodes without consideration for ZooKeeper. Performance-wise, I doubt it's a big deal either way. - Mark On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have 4 solr collections, 2-3mn documents per collection, up to 100K updates per collection daily (roughly). I'm going to create SolrCloud4x on Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The question is what about zookeeper? It's going to be external ensemble, but is it better to use same nodes as solr or dedicated micro instances? Zookeeper does not seem to be resources demanding process, but what would be better in this case ? To keep it inside of solrcloud or separately (micro instances seem to be enough here) ? Thanks in advance. Regards.
Re: SolrCloud and external Zookeeper ensemble
On 11/22/2012 2:18 AM, Luis Cappa Banda wrote: I´ve been dealing with the same question these days. In architecture terms, it´s always better to separate services (Solr and Zookeeper, in this case) rather to keep them in a single instance. However, when we have to deal with costs issues, all of use we are quite limitated and we must elect the best architecture/scalable/single point of failure option. As I see, the options are: *1. *Solr servers with Zookeeper embeded. *2. *Solr servers with external Zookeeper. *3.* Solr servers with external Zookeeper ensemble. I've never used SolrCloud, so this is all speculation based on what I've been reading. That has been mostly on this list, but also on dev@l.o and the IRC channel. I have a four-node Solr 3.5 deployment with about 80 million documents (130GB) in the distributed index. I think of my installation as small. Others might disagree with my opinion, but I know there are a lot of indexes out there that make mine look tiny. If I needed to set a similarly small setup with SolrCloud on four Solr servers, what I would pitch to management would be one extra machine (cheap, 1U, low-end processor, etc) to act as a standalone zookeeper node. For the other two zookeper instances, I would run standalone zookeeper (separate JVM from Solr) on two of the Solr servers. I might ask for a small boost in RAM and/or CPU on the two servers that serve double-duty. I would not run zookeeper in the same JVM as Solr. With a little bit of growth in the cluster, I would ask for a second standalone zookeeper node, pulling zookeeper off one of the Solr servers. If it continued to grow, then I would ask for the third. I would leave blank spots in the rack for those standalone servers. Thanks, Shawn
Re: How to get a list of servers per collection in sorlcloud using java api?
Hello, Joe. Try something like this using SolrJ library: String endpoints[] = // your Solr server endpoints. Example: http://localhost:8080/solr/core1 String zookeeperEndpoints = // your Zookeeper endpoints. Example: localhost:9000 String collectionName = // Your collection name. Example: core1 LBHttpSolrServer lbSolrServer = new LBHttpSolrServer(endpoints); this.cloudSolrServer = new CloudSolrServer(zookeeperEndpoints, lbSolrServer); this.cloudSolrServer.setDefaultCollection(collectionName); You have now created a CloudSolrServer instance which can manage Solr server operations: add a new document, delete, update, etc. Regards, - Luis Cappa. 2012/11/22 joe.cohe...@gmail.com joe.cohe...@gmail.com I want to write a function that will go thorugh all the servers that store a specific collection and perform a tsk on it, suppose RELOAD CORE task. How can I get a list of all solr servers/urls that run a specific collection? -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-a-list-of-servers-per-collection-in-sorlcloud-using-java-api-tp4021863.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Luis Cappa
Re: From Solr3.1 to SolrCloud
- I change my synonyms.txt on a solr node. How can i get zookeeper in sync and the other solr nodes without restart? Well, you can upload the whole collection configuration again with zkClient (included in the cloud.scripts section). see http://wiki.apache.org/solr/SolrCloud#Getting_your_Configuration_Files_into_ZooKeeper Other option, if you only want to upload one file is to write something that communicate with zk with any of it's apis. I did this before Solr's zkClient was committed and it is quite simple. Then, you can reload the collection, which is like reloading all the cores for the collection in the different nodes. See http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API - I read something more about zookeeper ensemble. When i need to run with 4 solr nodes(replicas) i need 3 zookeepers in ensemble(50% live). When zookeeper and solr are separated it will takes 7 servers to get it live. In the past we only needed 4 servers. Are there some other options because the costs will grow? 3 zookeeper servers sounds like overkill. The number of Solr instances doesn't have to do with the number of ZK instances that you need to run. You can effectively run with only one zk instance, the problem with this is that if that instance dies, then your whole cluster will go down. So you can increase the number of zk instances. When you create your Zookeeper ensemble, you declare the size of it (the number of zk instances it will contain). When you run that ensemble, Zookeeper requires that N/2+1 of the servers are connected. This means that if you want your zk ensemble to survive one instance dying, you'll need at least 3 ZK instances (if you have 2, and one dies, you still need 2 to work, so it wont). There has been some discussions these days in the list about this, but if the number of physical servers is too much for you, you could run on the same physical machine an instance of Solr and ZK. Tomás
Re: Solr Cloud Zookeeper Namespace
You could use Zookeeper's chroot: http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices You can use chroot in Solr by specifying it in the zkHost parameter, for example -DzkHost=localhost:2181/namespace1 In order for this to work, you need to first create the initial path (in the example above, you should create /namespace1 in zookeeper before starting Solr) Tomás On Thu, Nov 22, 2012 at 2:08 PM, Sandopolus sandopo...@gmail.com wrote: Is it possible with Solr Cloud 4.0 to specify a namespace for zookeeper so that you can run completely isolated Solr Cloud Clusters. There is the collection.configName property puts specific items into sub nodes for that collection, but certain things are still shared and in the root directory in Zookeeper like clusterstate.json What i am looking for a property which allows me to prepend a namespace to all nodes in Zookeeper that Solr Cloud inserts. Does anyone know if this exists?
Re: How to get a list of servers per collection in sorlcloud using java api?
Hello, As far as I know, you cannot do that at the moment, :-/ Regards, - Luis Cappa. 2012/11/22 joe.cohe...@gmail.com joe.cohe...@gmail.com Thanks Rakudten. I had my question mis-phrased. What I need is being able to get the solr servers storing a collection by giving the zookeeper server as an input. something like: // returns a list of solr servers in the zookeeper ensemble that store the given collection ListString getServers(String zkhost, String collectionName) -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-a-list-of-servers-per-collection-in-sorlcloud-using-java-api-tp4021863p4021883.html Sent from the Solr - User mailing list archive at Nabble.com. -- - Luis Cappa
Re: is there a way to prevent abusing rows parameter
Thanks guys. This is a problem with the front end not validating requests. I was hoping there might be a simple config value I could enter/change, rather than going the long process of migrating a proper fix all the way up to our production servers. Looks like not, but thx. -- View this message in context: http://lucene.472066.n3.nabble.com/is-there-a-way-to-prevent-abusing-rows-parameter-tp4021467p4021892.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Partial results with not enough hits
Thank you! That seems to be the case, I tried to execute queries without sorting and only one document in the response and I got execution time in the same range as before. -- Aleksey On 12-11-21 04:07 PM, Jack Krupansky wrote: It could be that the time to get set up to return even the first result is high and then each additional document is a minimal increment in time. Do a query with rows=1 (or even 0) and see what the minimum query time is for your query, index, and environment. -- Jack Krupansky -Original Message- From: Aleksey Vorona Sent: Wednesday, November 21, 2012 6:04 PM To: solr-user@lucene.apache.org Subject: Partial results with not enough hits In all of my queries I have timeAllowed parameter. My application is ready for partial results. However, whenever Solr returns partial result it is a very bad result. For example, I have a test query and here its execution log with the strict time allowed: WARNING: Query: omitted; Elapsed time: 120Exceeded allowed search time: 100 ms. INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=100} hits=189 status=0 QTime=119 Here it is without such a strict limitation: INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=1} hits=582 status=0 QTime=124 The total execution time is different by mere 5 ms, but the partial result has only about 1/3 of the full result. Is it the expected behaviour? Does that mean I can never rely on the partial results? I added timeAllowed to protect from too expensive wide queries, but I still want to return something relevant to the user. This query returned 30% of the full result, but I have other queries in the log where partial result is just empty. Am I doing something wrong? P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory. Load Average on the Solr box is very low. -- Aleksey
Re: Partial results with not enough hits
Thanks for the response. I have increased the timeout and it did not increase execution time or system load. It is really that I misused the timeout. Just to give you a bit of perspective, we added timeout to guarantee some level of QoS from the search engine. Our UI allows user to construct very complex queries and (what is worse) not all the time user really understands what she needs. That may become a problem if we have lots of users doing that. In this case I do not want to run such a complex query for seconds and want to return some result with a warning to the user that she is doing something wrong. But clearly, I set a timeout too low for that and started to harm even normal queries. Anyway, thanks everyone for the replies. The issue is fixed and I now understand how timeout works much better (which was the reason to post to this list). Thanks! -- Aleksey On 12-11-22 06:37 AM, Otis Gospodnetic wrote: Hi, Maybe your goal should be to make your queries faster instead of fighting with timeouts which are known not to work well. What is your hardware like? How about your queries? What do you see in debugQuery=true output? Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 21, 2012 6:04 PM, Aleksey Vorona avor...@ea.com wrote: In all of my queries I have timeAllowed parameter. My application is ready for partial results. However, whenever Solr returns partial result it is a very bad result. For example, I have a test query and here its execution log with the strict time allowed: WARNING: Query: omitted; Elapsed time: 120Exceeded allowed search time: 100 ms. INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**100} hits=189 status=0 QTime=119 Here it is without such a strict limitation: INFO: [] webapp=/solr path=/select params={omittedtimeAllowed=**1} hits=582 status=0 QTime=124 The total execution time is different by mere 5 ms, but the partial result has only about 1/3 of the full result. Is it the expected behaviour? Does that mean I can never rely on the partial results? I added timeAllowed to protect from too expensive wide queries, but I still want to return something relevant to the user. This query returned 30% of the full result, but I have other queries in the log where partial result is just empty. Am I doing something wrong? P.S. I am using Solr 3.6.1, index size is 3Gb and easily fits in memory. Load Average on the Solr box is very low. -- Aleksey
SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa
Re: How to get a list of servers per collection in sorlcloud using java api?
On Thu, Nov 22, 2012 at 7:20 PM, joe.cohe...@gmail.com joe.cohe...@gmail.com wrote: Thanks Rakudten. I had my question mis-phrased. What I need is being able to get the solr servers storing a collection by giving the zookeeper server as an input. something like: // returns a list of solr servers in the zookeeper ensemble that store the given collection ListString getServers(String zkhost, String collectionName) You can use ZKStateReader (#getClusterState) to get this info. -- Sami Siren
Reloading config to zookeeper
When we make changes to our config files, how do we reload the files into zookeeper. Also, I understand that we would need to reload the collection, would we need to do this at a per shard level or just at the cloud level. Regards, Ayush
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.comwrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa
Re: Reloading config to zookeeper
Hi, I'm using cloud-scripts/zkcli.sh script for reloading configuration, for example: $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir config.dir -solrhome solr.home -confname config.name -z zookeeper.host Then I'm reloading collection on each node in cloud, but maybe someone knows better solution. Regards. On 22 November 2012 19:23, Cool Techi cooltec...@outlook.com wrote: When we make changes to our config files, how do we reload the files into zookeeper. Also, I understand that we would need to reload the collection, would we need to do this at a per shard level or just at the cloud level. Regards, Ayush
RE: Reloading config to zookeeper
Thanks, but why do we need to specify the -solrhome? I am using the following command to load new config, java -classpath .:/Users/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185 -confdir /Users/config-files -confname myconf So basically reloading is just uploading the configs back again? Regard,s Ayush Date: Thu, 22 Nov 2012 19:32:27 +0100 Subject: Re: Reloading config to zookeeper From: mrzewu...@gmail.com To: solr-user@lucene.apache.org Hi, I'm using cloud-scripts/zkcli.sh script for reloading configuration, for example: $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir config.dir -solrhome solr.home -confname config.name -z zookeeper.host Then I'm reloading collection on each node in cloud, but maybe someone knows better solution. Regards. On 22 November 2012 19:23, Cool Techi cooltec...@outlook.com wrote: When we make changes to our config files, how do we reload the files into zookeeper. Also, I understand that we would need to reload the collection, would we need to do this at a per shard level or just at the cloud level. Regards, Ayush
Re: Reloading config to zookeeper
I think solrhome is not mandatory. Yes, reloading is uploading config dir again. It's a pity we can't update just modified files. Regards. On 22 November 2012 19:38, Cool Techi cooltec...@outlook.com wrote: Thanks, but why do we need to specify the -solrhome? I am using the following command to load new config, java -classpath .:/Users/solr-cli-lib/* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185 -confdir /Users/config-files -confname myconf So basically reloading is just uploading the configs back again? Regard,s Ayush Date: Thu, 22 Nov 2012 19:32:27 +0100 Subject: Re: Reloading config to zookeeper From: mrzewu...@gmail.com To: solr-user@lucene.apache.org Hi, I'm using cloud-scripts/zkcli.sh script for reloading configuration, for example: $ ./cloud-scripts/zkcli.sh -cmd upconfig -confdir config.dir -solrhome solr.home -confname config.name -z zookeeper.host Then I'm reloading collection on each node in cloud, but maybe someone knows better solution. Regards. On 22 November 2012 19:23, Cool Techi cooltec...@outlook.com wrote: When we make changes to our config files, how do we reload the files into zookeeper. Also, I understand that we would need to reload the collection, would we need to do this at a per shard level or just at the cloud level. Regards, Ayush
Re: Reload core via CoreAdminRequest doesnt work with solr cloud? (solrj)
If you need to reload all the cores from a given collection you can use the Collections API: http://localhost:8983/solr/admin/collections?action=RELOADname=mycollection On Thu, Nov 22, 2012 at 3:17 PM, joe.cohe...@gmail.com joe.cohe...@gmail.com wrote: Hi, I'm using solr-4.0.0 I'm trying to reload all the cores of a given collection in my solr cloud. I use it like this: CloudSolrServer server = new CloudSolrServer (zkserver:port); server.setDefaultCollection(collection1); CoreAdminRequest req = new CoreAdminRequest(); req.reloadCore(collection1, server) This throws an Exception telling me that no live solr servers are availble, listing the servers like this: http://server/solr/collection1 Of course doing other tasks like adding documnets to the CloudSolrServer above works fine. Using reloadCore on a HttpSolrServer also works fine. Any know issue with CloudSolrServer and CoreAdminRequest ? Note that I moved to solr-4.0.0 from solr-4.0.0-beta after trying the same thing also failed, but with a different exception. it failed saying cannot cast string to map in class ClusterState, in load() method (line 300), because the key range gave some String value instead of a map object. -- View this message in context: http://lucene.472066.n3.nabble.com/Reload-core-via-CoreAdminRequest-doesnt-work-with-solr-cloud-solrj-tp4021882.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.comwrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa
upgrading from 4.0 to 4.1 causes CorruptIndexException: checksum mismatch in segments file
hi all I have been working on moving us from 4.0 to a newer build of 4.1 I am seeing a CorruptIndexException: checksum mismatch in segments file error when I try to use the existing index files. I did see something in the build log for #119 re LUCENE-4446 that mentions flip file formats to point to 4.1 format Do I just need to reindex or is this some other issue (ie do I need to configure something differently)? or should I move back a few builds? note, we are currently using: solr-spec 4.0.0.2012.04.05.15.05.52 solr-impl 4.0-SNAPSHOT 1310094M - - 2012-04-05 15:05:52 lucene-spec 4.0-SNAPSHOT lucene-impl 4.0-SNAPSHOT 1309921 - - 2012-04-05 10:25:27 and are considering moving to: solr-spec 4.1.0.2012.11.03.18.08.42 solr-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:08:42 lucene-spec 4.1-2012-11-03_18-05-49 lucene-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:06:50 (aka apache-solr-4.1-2012-11-03_18-05-49) -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-from-4-0-to-4-1-causes-CorruptIndexException-checksum-mismatch-in-segments-file-tp4021913.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error: _version_field must exist in schema
On Wed, Oct 17, 2012 at 3:20 PM, Dotan Cohen dotanco...@gmail.com wrote: I do have a Solr 4 Beta index running on Websolr that does not have such a field. It works, but throws many Service Unavailable and Communication Error errors. Might the lack of the _version_ field be the reason? Belated reply, but this is probably something you should let us know about directly at supp...@onemorecloud.com if it happens again. Cheers. -- Nick Zadrozny Cofounder, One More Cloud websolr.com https://websolr.com/home • bonsai.io http://bonsai.io/home Hassle-free hosted full-text search, powered by Apache Solr and ElasticSearch.
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
For more details, my indexation App is: 1. Multithreaded. 2. NRT indexation. 3. It´s a Web App with a REST API. It receives asynchronous requests that produces those atomic updates / document reindexations I told before. I´m pretty sure that the wrong behavior is related with CloudSolrServer and with the fact that maybe you are trying to modify the index while an index update is in course. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda luisca...@gmail.com Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa
Re: SolrCloud: Very strange behavior when doing atomic updates or documents reindexation.
More info: - I´m trying to update the document re-indexing the whole document again. I first retrieve the document querying by it´s id, then delete it by it´s id, and re-index including the new changes. - At the same time there are other index writing operations. *RESULT*: in most cases the document wasn´t updated. Bad news... it smells like a critical bug. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda luisca...@gmail.com For more details, my indexation App is: 1. Multithreaded. 2. NRT indexation. 3. It´s a Web App with a REST API. It receives asynchronous requests that produces those atomic updates / document reindexations I told before. I´m pretty sure that the wrong behavior is related with CloudSolrServer and with the fact that maybe you are trying to modify the index while an index update is in course. Regards, - Luis Cappa. 2012/11/22 Luis Cappa Banda luisca...@gmail.com Hello! I´m using a simple test configuration with nShards=1 without any replica. SolrCloudServer is suposed to forward properly those index/update operations, isn´t it? I test with a complete document reindexation, not atomic updates, using the official LBHttpSolrServer, not my custom BinaryLBHttpSolrServer, and it dosn´t work. I think is not just a bug related with atomic updates via CloudSolrServer but a general bug when an index changes with reindexations/updates frequently. Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com It might even depend on the cluster layout! Let's say you have 2 shards (no replicas) if the doc belongs to the node you send it to so that it does not get forwarded to another node then the update should work and in case where the doc gets forwarded to another node the problem occurs. With replicas it could appear even more strange: the leader might have the doc right and the replica not. I only briefly looked at the bits that deal with this so perhaps there's something more involved. On Thu, Nov 22, 2012 at 8:29 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hi, Sami! But isn´t strange that some documents were updated (atomic updates) correctly and other ones not? Can´t it be a more serious problem like some kind of index writer lock, or whatever? Regards, - Luis Cappa. 2012/11/22 Sami Siren ssi...@gmail.com I think the problem is that even though you were able to work around the bug in the client solr still uses the xml format internally so the atomic update (with multivalued field) fails later down the stack. The bug you filed needs to be fixed to get the problem solved. On Thu, Nov 22, 2012 at 8:19 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello everyone. I´ve starting to seriously worry about with SolrCloud due an strange behavior that I have detected. The situation is this the following: *1.* SolrCloud with one shard and two Solr instances. *2.* Indexation via SolrJ with CloudServer and a custom BinaryLBHttpSolrServer that uses BinaryRequestWriter to execute correctly atomic updates. Check JIRA-4080 https://issues.apache.org/jira/browse/SOLR-4080?focusedCommentId=13498055page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13498055 *3.* An asynchronous proccess updates partially some document fields. After that operation I automatically execute a commit, so the index must be reloaded. What I have checked is that both using atomic updates or complete document reindexations* aleatory documents are not updated* *even if I saw debugging how the add() and commit() operations were executed correctly* *and without errors*. Has anyone experienced a similar behavior? Is it posible that if an index update operation didn´t finish and CloudSolrServer receives a new one this second update operation doesn´t complete? Thank you in advance. Regards, -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa -- - Luis Cappa
Re: upgrading from 4.0 to 4.1 causes CorruptIndexException: checksum mismatch in segments file
Moving from the final release of 4.0 to 4.1 should be fine, but you appear to be using a snapshot of 4.0 that is even older than the ALPHA release of 4.0 and a number of format changes occurred last Spring. So, yeah, you will have to re-index. -- Jack Krupansky -Original Message- From: solr-user Sent: Thursday, November 22, 2012 2:03 PM To: solr-user@lucene.apache.org Subject: upgrading from 4.0 to 4.1 causes CorruptIndexException: checksum mismatch in segments file hi all I have been working on moving us from 4.0 to a newer build of 4.1 I am seeing a CorruptIndexException: checksum mismatch in segments file error when I try to use the existing index files. I did see something in the build log for #119 re LUCENE-4446 that mentions flip file formats to point to 4.1 format Do I just need to reindex or is this some other issue (ie do I need to configure something differently)? or should I move back a few builds? note, we are currently using: solr-spec 4.0.0.2012.04.05.15.05.52 solr-impl 4.0-SNAPSHOT 1310094M - - 2012-04-05 15:05:52 lucene-spec 4.0-SNAPSHOT lucene-impl 4.0-SNAPSHOT 1309921 - - 2012-04-05 10:25:27 and are considering moving to: solr-spec 4.1.0.2012.11.03.18.08.42 solr-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:08:42 lucene-spec 4.1-2012-11-03_18-05-49 lucene-impl 4.1-2012-11-03_18-05-49 1405392 - hudson - 2012-11-03 18:06:50 (aka apache-solr-4.1-2012-11-03_18-05-49) -- View this message in context: http://lucene.472066.n3.nabble.com/upgrading-from-4-0-to-4-1-causes-CorruptIndexException-checksum-mismatch-in-segments-file-tp4021913.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Find the matched field in each matched document
No, not directly, but indirectly you can - add debugQuery=true to your request and the explain section will detail which terms matched in which fields. You could probably also implement a custom search component which annotated each document with the matched field names. In that sense, Solr CAN do it. -- Jack Krupansky -Original Message- From: Alireza Salimi Sent: Thursday, November 22, 2012 6:11 PM To: solr-user@lucene.apache.org Subject: Re: Find the matched field in each matched document Maybe I should say it in different way: By having documents like above, I want to know what Robert De Niro is? Is it an actor or a movie title. you can just tell me if Solr can do it or not, it will be enough. Thanks On Thu, Nov 22, 2012 at 1:57 PM, Alireza Salimi alireza.sal...@gmail.comwrote: Hi, I apologize if i'm asking a duplicate question but I haven't found any good answer for my problem. My question is: How can I find out the type of fields that are matched to the search criteria, when I search over multip fields. Assume I have documents like this: {title: Robert De Niro, actors: []} {title: ronin, actors: [robert de niro, jean reno]} {title: casino, actors: [robert de niro, Joe Pesci]} Here's is the schema: field name=actors indexed=true multiValued=true stored=true termPositions=true termOffsets=true termVectors=true type=text_general / field name=title indexed=true multiValued=false stored=true type=text_general / Now after search for robert de niro in both title and Actors, I will have some matches, but my question is: How can I find out what robert de niro is? Is he an actor or a movie title? Thanks in advance -- Alireza Salimi Java EE Developer -- Alireza Salimi Java EE Developer
Re: Find the matched field in each matched document
Hi Jack, Thanks for the reply. I'm not sure about debug components, I thought it slows down query time. Can you explain more about custom search component? Thanks On Thu, Nov 22, 2012 at 7:02 PM, Jack Krupansky j...@basetechnology.comwrote: No, not directly, but indirectly you can - add debugQuery=true to your request and the explain section will detail which terms matched in which fields. You could probably also implement a custom search component which annotated each document with the matched field names. In that sense, Solr CAN do it. -- Jack Krupansky -Original Message- From: Alireza Salimi Sent: Thursday, November 22, 2012 6:11 PM To: solr-user@lucene.apache.org Subject: Re: Find the matched field in each matched document Maybe I should say it in different way: By having documents like above, I want to know what Robert De Niro is? Is it an actor or a movie title. you can just tell me if Solr can do it or not, it will be enough. Thanks On Thu, Nov 22, 2012 at 1:57 PM, Alireza Salimi alireza.sal...@gmail.com **wrote: Hi, I apologize if i'm asking a duplicate question but I haven't found any good answer for my problem. My question is: How can I find out the type of fields that are matched to the search criteria, when I search over multip fields. Assume I have documents like this: {title: Robert De Niro, actors: []} {title: ronin, actors: [robert de niro, jean reno]} {title: casino, actors: [robert de niro, Joe Pesci]} Here's is the schema: field name=actors indexed=true multiValued=true stored=true termPositions=true termOffsets=true termVectors=true type=text_general / field name=title indexed=true multiValued=false stored=true type=text_general / Now after search for robert de niro in both title and Actors, I will have some matches, but my question is: How can I find out what robert de niro is? Is he an actor or a movie title? Thanks in advance -- Alireza Salimi Java EE Developer -- Alireza Salimi Java EE Developer -- Alireza Salimi Java EE Developer
Re: Performance improvement for solr faceting on large index
Hi, I don't quite follow what you are trying gyroscope do, but it almost sounds like you may be better off using something other than Solr if all you are doing is filtering by site and counting something. I see unigrams in what looks like it could be a big field and that's a red flag. Your index is quite big - how much memory have you got? Do those queries produce a lot of disk IO. I have a feeling they do. If so, your shards may be too large for your hardware. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 22, 2012 7:53 AM, Pravin Agrawal pravin_agra...@persistent.co.in wrote: Hi All, We are using solr 3.4 with following schema fields. schema.xml--- fieldType name=autosuggest_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=5 outputUnigrams=true/ filter class=solr.PatternReplaceFilterFactory pattern=^([0-9. ])*$ replacement= replace=all/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=id type=string stored=true indexed=true/ field name=autoSuggestContent type=autosuggest_text stored=true indexed=true multiValued=true/ copyField source=content dest=autoSuggestContent/ copyField source=original_title dest=autoSuggestContent/ field name=content type=text stored=true indexed=true/ field name=original_title type=text stored=true indexed=true/ field name=site type=site stored=false indexed=true/ /schema.xml--- The index on above schema is distributed on two solr shards with each index size of about 1.2 million, and size on disk of about 195GB per shard. We want to retrieve (site, autoSuggestContent term, frequency of the term) information from our above main solr index. The site is a field in document and contains name of site to which that document belongs. The terms are retrieved from multivalued field autoSuggestContent which is created using shingles from content and title of the web page. As of now, we are using facet query to retrieve (term, frequency of term) for each site. Below is a sample query (you may ignore initial part of query) http://localhost:8080/solr/select?indent=onq=*:*fq=site:www.abc.comstart=0rows=0fl=idqt=dismaxfacet=truefacet.field=autoSuggestContentfacet.mincount=25facet.limit=-1facet.method=enumfacet.sort=index The problem is that with increase in index size, this method has started taking huge time. It used to take 7 minutes per site with index size of 0.4 million docs but takes around 60-90 minutes for index size of 2.5 million(). With this speed, it will take around 5-6 days to index complete 1500 sites. Also we are expecting the index size to grow with more documents and more sites and as such time to get the above information will increase further. Please let us know if there is any better way to extract (site, term, frequency) information compare to current method. Thanks, Pravin Agrawal DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: SolrCloud and external Zookeeper ensemble
Note the number of zookeeper nodes is independent of number of shards. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 22, 2012 4:19 AM, Luis Cappa Banda luisca...@gmail.com wrote: Hello, I´ve been dealing with the same question these days. In architecture terms, it´s always better to separate services (Solr and Zookeeper, in this case) rather to keep them in a single instance. However, when we have to deal with costs issues, all of use we are quite limitated and we must elect the best architecture/scalable/single point of failure option. As I see, the options are: *1. *Solr servers with Zookeeper embeded. *2. *Solr servers with external Zookeeper. *3.* Solr servers with external Zookeeper ensemble. *Note*: as far as I know, the recommended number of Zookeeper services to avoid single points of failure is:* ZkNum = 2 * Numshards - 1*. If you have The best option is the third one. Reasons: *1. *If one of your Solr servers goes down, Zookeeper services still up. *2.* If one of your Zookeeper services goes down, Solr servers and the rest of Zookeeper services still up. Considering that option, we have two ways to implement it in production: *1. *Each service (Solr and Zookeeper) in separate machines. Let´s imagine that we have 2 shards for a given collection, so we need at least 4 Solr servers to complete the leader-replica configuration. The best option is to deploy them in for Amazon instances, one per each server. We need at least 3 Zookeeper services in a Zookeeper ensemble configuration. The optimal way to install them is in separates machines (micro instance will be nice for Zookeeper), so we will have 7 Amazon instances. The reason is that if one machine goes down (Solr or Zookeeper one) the others services may still up and your production environment will be safe. However,* for me this is the best case, but it´s the more expensive one*, so in my case is imposible to make real. *2. *As wee need at least 4 Solr servers and 3 Zookeeper services up, I would install three Amazon instances with Solr and Zookeeper, and one of them only with Solr. So we´ll have: 3 complete Amazon instances (Solr + Zookeeper) and 1 single Amazon instance (only Solr). If one of them goes down, the production environment will be safe. This architecture is not the best one, as I told you, but I think that is optimal in terms of robustness, single point of failure and costs. It would be a pleasure to hear new suggestions from other people that dealed with this kind of issues. Regards, - Luis Cappa. 2012/11/21 Marcin Rzewucki mrzewu...@gmail.com Yes, I meant the same (not -zkRun). However, I was asking if it is safe to have zookeeper and solr processes running on the same node or better on different machines? On 21 November 2012 21:18, Rafał Kuć r@solr.pl wrote: Hello! As I told I wouldn't use the Zookeeper that is embedded into Solr, but rather setup a standalone one. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch First of all: thank you for your answers. Yes, I meant side by side configuration. I think the worst case for ZKs here is to loose two of them. However, I'm going to use 4 availability zones in same region so at least this will reduce the risk of loosing both of them at the same time. Regards. On 21 November 2012 17:06, Rafał Kuć r@solr.pl wrote: Hello! Zookeeper by itself is not demanding, but if something happens to your nodes that have Solr on it, you'll loose ZooKeeper too if you have them installed side by side. However if you will have 4 Solr nodes and 3 ZK instances you can get them running side by side. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Separate is generally nice because then you can restart Solr nodes without consideration for ZooKeeper. Performance-wise, I doubt it's a big deal either way. - Mark On Nov 21, 2012, at 8:54 AM, Marcin Rzewucki mrzewu...@gmail.com wrote: Hi, I have 4 solr collections, 2-3mn documents per collection, up to 100K updates per collection daily (roughly). I'm going to create SolrCloud4x on Amazon's m1.large instances (7GB mem,2x2.4GHz cpu each). The question is what about zookeeper? It's going to be external ensemble, but is it better to use same nodes as solr or dedicated micro instances? Zookeeper does not seem to be resources demanding process, but what would be better in this case ? To keep it inside of solrcloud or separately (micro instances seem to be enough here) ? Thanks in advance. Regards. -- - Luis Cappa
User context based search in apache solr
In our application we are providing product master data search with SOLR. Now our requirement want to provide user context based search(means we are providing top search result using user history). For that i have created one score table having following field 1)product_id 2)user_id 3)score_value As soon as user clicked for any product that will create entry in this table and also increase score_value if already present product for that user. We are planning to use boost field and eDisMax from SOLR to improve search result but for this i have to use one to many mapping between score and product table(Because we are having one product with different score value for different user) and solr not providing one to many mapping. We can solved this issue (one to many mapping handling) by de-normalizing structure as having multiple product entry with different score value for different user but it result huge amount of redundant data. Is this(de-normalized structure) currect way to handle or is there any other way to handle such context based search. Plz help me -- View this message in context: http://lucene.472066.n3.nabble.com/User-context-based-search-in-apache-solr-tp4021964.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error: _version_field must exist in schema
On Thu, Nov 22, 2012 at 9:26 PM, Nick Zadrozny n...@onemorecloud.com wrote: Belated reply, but this is probably something you should let us know about directly at supp...@onemorecloud.com if it happens again. Cheers. Hi Nick. This particular issue was on a Solr 4 instance on AWS, not on the Websolr account. But I commend you taking notice and taking an interest. Thank you! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
RE: Solr UIMA with KEA
See: http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html -Original message- From:nutchsolruser nutchsolru...@gmail.com Sent: Fri 23-Nov-2012 06:53 To: solr-user@lucene.apache.org Subject: Solr UIMA with KEA Is there any way we can extract tags or keyphrases from solr document at index time? I know we can use solr UIMA library to enrich solr document with metadata but it require alchemy API key (which we have to purchase for commercial use) . Can we wrap KeyPhraseExtractor(KEA) in UIMA for this purpose if yes then then let me know some useful pointers for doing this. Thank you , -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UIMA-with-KEA-tp4021962.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr UIMA with KEA
Sorry, wrong list :) -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Fri 23-Nov-2012 08:32 To: solr-user@lucene.apache.org Subject: RE: Solr UIMA with KEA See: http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html -Original message- From:nutchsolruser nutchsolru...@gmail.com Sent: Fri 23-Nov-2012 06:53 To: solr-user@lucene.apache.org Subject: Solr UIMA with KEA Is there any way we can extract tags or keyphrases from solr document at index time? I know we can use solr UIMA library to enrich solr document with metadata but it require alchemy API key (which we have to purchase for commercial use) . Can we wrap KeyPhraseExtractor(KEA) in UIMA for this purpose if yes then then let me know some useful pointers for doing this. Thank you , -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-UIMA-with-KEA-tp4021962.html Sent from the Solr - User mailing list archive at Nabble.com.