Re: Concat 2 fields in another field
Hi all, thanks for your replies. I have managed to do this by writing custom updateprocessor and configured it as bellow processor class=com.test.solr.update.CustomConcatFieldUpdateprocessorFactory str name=fieldfirstName/str str name=fieldlastName/str str name=destfullName/str str name=delimiter_/str /processor. Federico Chiacchiaretta , I have tried the option mentioned by you but on frequent update of the document it keeps on adding the value multiple times which I don't want . In my custom component I checked for existing value and if its empty then I have updated it by fN_lN. Thanks a lot for quick replies. -- View this message in context: http://lucene.472066.n3.nabble.com/Concat-2-fields-in-another-field-tp4086786p4086934.html Sent from the Solr - User mailing list archive at Nabble.com.
why does a node switch state ?
hi, I have a solrcloud with 8 jvm, which has 4 shards(2 nodes for each shard). 1000 000 docs are indexed per day, and 10 query requests per second, and sometimes, maybe there are 100 query requests per second. in each shard, one jvm has 8G ram, and another has 5G. the jvm args is like this: -Xmx5000m -Xms5000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection -XX:+PrintGCDateStamps -XX:+PrintGC -Xloggc:log/jvmsolr.log OR -Xmx8000m -Xms8000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection -XX:+PrintGC -XX:+PrintGCDateStamps -Xloggc:log/jvmsolr.log Nodes works well, but also switch state every day (at the same time, gc becomes abnormal like below). 2013-08-28T13:29:39.140+0800: 97180.866: [GC 3770296K-2232626K(4608000K), 0.0099250 secs] 2013-08-28T13:30:09.324+0800: 97211.050: [GC 3765732K-2241711K(4608000K), 0.0124890 secs] 2013-08-28T13:30:29.777+0800: 97231.504: [GC 3760694K-2736863K(4608000K), 0.0695530 secs] 2013-08-28T13:31:02.887+0800: 97264.613: [GC 4258337K-4354810K(4608000K), 0.1374600 secs] 97264.752: [Full GC 4354810K-2599431K(4608000K), 6.7833960 secs] 2013-08-28T13:31:09.884+0800: 97271.610: [GC 2750517K(4608000K), 0.0054320 secs] 2013-08-28T13:31:15.354+0800: 97277.080: [GC 3550474K(4608000K), 0.0871270 secs] 2013-08-28T13:31:31.258+0800: 97292.984: [GC 3877223K(4608000K), 0.1551870 secs] 2013-08-28T13:31:34.396+0800: 97296.123: [GC 3877223K(4608000K), 0.1220380 secs] 2013-08-28T13:31:38.102+0800: 97299.828: [GC 3877225K(4608000K), 0.1545500 secs] 2013-08-28T13:31:40.227+0800: 97303.019: [Full GC 4174941K-2127315K(4608000K), 6.3435150 secs] 2013-08-28T13:31:49.645+0800: 97311.371: [GC 2508466K(4608000K), 0.0355180 secs] 2013-08-28T13:31:57.645+0800: 97319.371: [GC 2967737K(4608000K), 0.0579650 secs] even more, sometimes a shard is down(one node is recovering, another is down), that is an absolute disaster... please help me. any advice is welcome... -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-a-node-switch-state-tp4086939.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple replicas for specific shard
Thanks Keith! But could this be done dinamically? Let's take the following example: a SolrCloud cluster with sport event results split in three shards by category - footbal shard, golf shard and baseball shard. Each of this shards has a replica on a machine. Then i realize that my footbal related QPS grow dramatically so i decide to add 2 more replicas for the footbal shard, on two new machines. How can i proceed in this situatian ? -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-replicas-for-specific-shard-tp4086828p4086941.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Regular expression, returning only matched substring
hi Erick, Appreciate your reply. Facet.query will give count of matches not the count of unique pattern matches. if i give regular expression [0-9]{3} to match a 3 digit number it will return total occurrences of three digit numbers, but i want to know occurrences of unique 3 numbers. lets say i have number 100 occurred 10 times and 500 occurred 5 times. facet.query will return count as 15, instead of giving count of 100 and 500 individually. Hope i made myself clear. is there any way to to this? thanks and regards jai -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Regular-expression-returning-only-matched-substring-tp4086868p4086944.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why does a node switch state ?
Do you see anything in the solr logs as to what the trigger for your nodes changing state was? You should see some kind of error/warning before the election is triggered. My gut feeling would be loss of communication between your leader and ZK (possibly by a GC event that locks the JVM for a while) but that's pure conjecture given you haven't given a lot of information. What is your ZK timeout? You are seeing a 6s GC event, so if that is locking the JVM for that long, and your ZK timeout is less than that, it is likely that ZK thinks that node has gone away, so it forces an election to find a new leader. But there should be evident of that in the logs, you should see the ZK connection drop. On 28 August 2013 08:25, sling sling...@gmail.com wrote: hi, I have a solrcloud with 8 jvm, which has 4 shards(2 nodes for each shard). 1000 000 docs are indexed per day, and 10 query requests per second, and sometimes, maybe there are 100 query requests per second. in each shard, one jvm has 8G ram, and another has 5G. the jvm args is like this: -Xmx5000m -Xms5000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection -XX:+PrintGCDateStamps -XX:+PrintGC -Xloggc:log/jvmsolr.log OR -Xmx8000m -Xms8000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection -XX:+PrintGC -XX:+PrintGCDateStamps -Xloggc:log/jvmsolr.log Nodes works well, but also switch state every day (at the same time, gc becomes abnormal like below). 2013-08-28T13:29:39.140+0800: 97180.866: [GC 3770296K-2232626K(4608000K), 0.0099250 secs] 2013-08-28T13:30:09.324+0800: 97211.050: [GC 3765732K-2241711K(4608000K), 0.0124890 secs] 2013-08-28T13:30:29.777+0800: 97231.504: [GC 3760694K-2736863K(4608000K), 0.0695530 secs] 2013-08-28T13:31:02.887+0800: 97264.613: [GC 4258337K-4354810K(4608000K), 0.1374600 secs] 97264.752: [Full GC 4354810K-2599431K(4608000K), 6.7833960 secs] 2013-08-28T13:31:09.884+0800: 97271.610: [GC 2750517K(4608000K), 0.0054320 secs] 2013-08-28T13:31:15.354+0800: 97277.080: [GC 3550474K(4608000K), 0.0871270 secs] 2013-08-28T13:31:31.258+0800: 97292.984: [GC 3877223K(4608000K), 0.1551870 secs] 2013-08-28T13:31:34.396+0800: 97296.123: [GC 3877223K(4608000K), 0.1220380 secs] 2013-08-28T13:31:38.102+0800: 97299.828: [GC 3877225K(4608000K), 0.1545500 secs] 2013-08-28T13:31:40.227+0800: 97303.019: [Full GC 4174941K-2127315K(4608000K), 6.3435150 secs] 2013-08-28T13:31:49.645+0800: 97311.371: [GC 2508466K(4608000K), 0.0355180 secs] 2013-08-28T13:31:57.645+0800: 97319.371: [GC 2967737K(4608000K), 0.0579650 secs] even more, sometimes a shard is down(one node is recovering, another is down), that is an absolute disaster... please help me. any advice is welcome... -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-a-node-switch-state-tp4086939.html Sent from the Solr - User mailing list archive at Nabble.com.
Suspicious message with attachment
The following message addressed to you was quarantined because it likely contains a virus: Subject: Newbie SOLR question From: =?windows-1251?B?wPLg7eDxIMDy4O3g8e7i?= atanaso...@gmail.com However, if you know the sender and are expecting an attachment, please reply to this message, and we will forward the quarantined message to you.
Re: Solr 4.2 Regular expression, returning only matched substring
Ah, OK. Nothing springs to mind. Even faceting on the individual values of the field counts _documents_ that match, but doesn't give you which particular values matched. I suppose that in that case you could run your regex over the returned labels for the facets. But that's a really ugly solution. Problem is that in a field with 1M unique values you'd get a list 1M long perhaps which wouldn't perform at all well. Depending, you could enumerate your terms (see TermsComponent) using terms.regex to get a list of all terms that matched your regex up-front, then do some relatively painful facet querying on a long list of the returned values, again not something I'd do in a high-query environment. Depends I guess on how busy your website is Best Erick On Wed, Aug 28, 2013 at 4:18 AM, jai2 jai4l...@gmail.com wrote: hi Erick, Appreciate your reply. Facet.query will give count of matches not the count of unique pattern matches. if i give regular expression [0-9]{3} to match a 3 digit number it will return total occurrences of three digit numbers, but i want to know occurrences of unique 3 numbers. lets say i have number 100 occurred 10 times and 500 occurred 5 times. facet.query will return count as 15, instead of giving count of 100 and 500 individually. Hope i made myself clear. is there any way to to this? thanks and regards jai -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-Regular-expression-returning-only-matched-substring-tp4086868p4086944.html Sent from the Solr - User mailing list archive at Nabble.com.
Help to figure out why query does not match
Hi, please help me figure out what's going on. I have the next field type: fieldType name=words_ngram class=solr.TextField omitNorms=false analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ / filter class=solr.StopFilterFactory words=url_stopwords.txt ignoreCase=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=20 / /analyzer analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ / filter class=solr.StopFilterFactory words=url_stopwords.txt ignoreCase=true / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType And the next string indexed: http://plus.google.com/111950520904110959061/profile Here is what the analyzer shows: http://img607.imageshack.us/img607/5074/fn1.png Then I do the next query: fq=type:Site sort=score desc q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile fl=* score qf=url_words_ngram defType=edismax start=0 rows=20 mm=1 And have no results. These queries do match: 1. https://plus.google 2. https://plus.google.com 3. 11195052090 And these do not: 1. https://plus.google.com/111950520904110959061/profile 2. 111950520904110959061/profile 3. 111950520904110959061 The reason is that 111950520904110959061 length is 21 when I have max gram size set to 20. Tried to increase max gram size to 200 and it works, but is there any way to match given query without doing that? The query analyzer show there are exact matches at PT, SF and LCF or does it work that way so in index we have only the output from the last filter factory (ENGTF in my example)? If so, is there an option to preserve the original tokens also? So that for maxGramSize=5 and indexed string awesomeness I'd have: a, aw, awe, awes, aweso, awesomeness Best, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to patch Solr4.2 for SolrEnityProcessor Sub-Enity issue
This is fixed in trunk and branch_4x and will be available in the next release (4.5) See https://issues.apache.org/jira/browse/SOLR-5190 On Mon, Aug 26, 2013 at 12:37 PM, harshchawla ha...@livecareer.com wrote: Thanks a lot in advance. I am eagerly waiting for your response. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-patch-Solr4-2-for-SolrEnityProcessor-Sub-Enity-issue-tp4086292p4086572.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: How to patch Solr4.2 for SolrEnityProcessor Sub-Enity issue
Thanks a lot for this fix. I am now eagerly waiting for solr - 4.5 -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-patch-Solr4-2-for-SolrEnityProcessor-Sub-Enity-issue-tp4086292p4086973.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple replicas for specific shard
http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin. Essentially you create a core on a new machine and assign it a collection and shard. It'll register itself, replicate the data from the leader and join the cluster automatically. You could script this too, but be aware that the replication may take quite a while depending on the network speed and the size of your index. Best Erick On Wed, Aug 28, 2013 at 4:09 AM, maephisto my_sky...@yahoo.com wrote: Thanks Keith! But could this be done dinamically? Let's take the following example: a SolrCloud cluster with sport event results split in three shards by category - footbal shard, golf shard and baseball shard. Each of this shards has a replica on a machine. Then i realize that my footbal related QPS grow dramatically so i decide to add 2 more replicas for the footbal shard, on two new machines. How can i proceed in this situatian ? -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-replicas-for-specific-shard-tp4086828p4086941.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 4.0 - Fuzzy query and Proximity query
Hi, with solr 4.0 the fuzzy query syntax is like keyword~1 (or 2) Proximity search is like value~20. How does this differentiate between the two searches. My thought was promiximity would be on phrases and fuzzy on individual words. Is that correct? I wasnted to do a promiximity search for text field and gave the below query, ip:port/collection1/select?q=trinity%20service~50debugQuery=yes, it gives me results as result name=response numFound=111 start=0 maxScore=4.1237307 doc str name=business_name*Trinidad *Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_name*Trinitee *Service/str How to differentiate between fuzzy and proximity. Thanks, Prasi
Re: Help to figure out why query does not match
Hmmm, Certainly only the outputs of the last filter make it into the index. Consider stopwords being the last filter, you'd expect stopwords to be removed. There's nothing that I know of that'll do what you're asking, the code for ENGTF doesn't have any preserve original that I see. This seems like a useful addition though, you've done a nice job of characterizing the problem. Want to raise a JIRA and/or do a patch? I'd guess your only real short-term workaround would be to increase the max gram size. I suppose you could do a copyfield into a field that doesn't do the n-gramming and search against that too, but that feels kind of kludgy... Best, Erick On Wed, Aug 28, 2013 at 7:16 AM, heaven aheave...@gmail.com wrote: Hi, please help me figure out what's going on. I have the next field type: fieldType name=words_ngram class=solr.TextField omitNorms=false analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ / filter class=solr.StopFilterFactory words=url_stopwords.txt ignoreCase=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=20 / /analyzer analyzer type=query tokenizer class=solr.PatternTokenizerFactory pattern=[^\d\w]+ / filter class=solr.StopFilterFactory words=url_stopwords.txt ignoreCase=true / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType And the next string indexed: http://plus.google.com/111950520904110959061/profile Here is what the analyzer shows: http://img607.imageshack.us/img607/5074/fn1.png Then I do the next query: fq=type:Site sort=score desc q=https\\:\\/\\/plus.google.com\\/111950520904110959061\\/profile fl=* score qf=url_words_ngram defType=edismax start=0 rows=20 mm=1 And have no results. These queries do match: 1. https://plus.google 2. https://plus.google.com 3. 11195052090 And these do not: 1. https://plus.google.com/111950520904110959061/profile 2. 111950520904110959061/profile 3. 111950520904110959061 The reason is that 111950520904110959061 length is 21 when I have max gram size set to 20. Tried to increase max gram size to 200 and it works, but is there any way to match given query without doing that? The query analyzer show there are exact matches at PT, SF and LCF or does it work that way so in index we have only the output from the last filter factory (ENGTF in my example)? If so, is there an option to preserve the original tokens also? So that for maxGramSize=5 and indexed string awesomeness I'd have: a, aw, awe, awes, aweso, awesomeness Best, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-tp4086967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.0 - Fuzzy query and Proximity query
The first thing I'd recommend is to look at the admin/analysis page. I suspect you aren't seeing fuzzy query results at all, what you're seeing is the result of stemming. Stemming is algorithmic, so sometimes produces very surprising results, i.e. Trinidad and Trinigee may stem to something like triniti. But you didn't provide the field definition so it's just a guess. Best Erick On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote: Hi, with solr 4.0 the fuzzy query syntax is like keyword~1 (or 2) Proximity search is like value~20. How does this differentiate between the two searches. My thought was promiximity would be on phrases and fuzzy on individual words. Is that correct? I wasnted to do a promiximity search for text field and gave the below query, ip:port/collection1/select?q=trinity%20service~50debugQuery=yes, it gives me results as result name=response numFound=111 start=0 maxScore=4.1237307 doc str name=business_name*Trinidad *Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_name*Trinitee *Service/str How to differentiate between fuzzy and proximity. Thanks, Prasi
Data Centre recovery/replication, does this seem plausible?
We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? In essence, we want to do what happens within Solr cloud's recovery, so (as I understand cloud recovery) a node starts up, (I'm assuming worst case and peer sync has failed) then buffers all updates into the transaction log, replicates from the leader, and replays the transaction log to get everything in sync. Is it conceivable to do the same by extending Solr, so on the activation of some handler (user triggered), we initiated a replicate from other DC, which puts all the leaders into buffering updates, replicate from some other set of servers and then replay? Our goal is to try to minimize the downtime (beyond the initial outage), so we would ideally like to be able to start up indexing before this replicate/clone has finished, that's why I thought to enable buffering on the transaction log. Searches shouldn't be sent here, but if they do we have a valid (albeit old) index to serve those until the new one swaps in. Just curious how any other DC-aware setups handle this kind of scenario? Or other concerns, issues with this type of approach.
Re: Solr 4.0 - Fuzzy query and Proximity query
hi Erick, Yes it is correct. These results are because of stemming + phonetic matching. Below is the Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services Query time SF triniti servic PF TRNTtriniti SRFKservic HWF TRNTtriniti SRFKservic PSF TRNTtriniti SRFKservic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.comwrote: The first thing I'd recommend is to look at the admin/analysis page. I suspect you aren't seeing fuzzy query results at all, what you're seeing is the result of stemming. Stemming is algorithmic, so sometimes produces very surprising results, i.e. Trinidad and Trinigee may stem to something like triniti. But you didn't provide the field definition so it's just a guess. Best Erick On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote: Hi, with solr 4.0 the fuzzy query syntax is like keyword~1 (or 2) Proximity search is like value~20. How does this differentiate between the two searches. My thought was promiximity would be on phrases and fuzzy on individual words. Is that correct? I wasnted to do a promiximity search for text field and gave the below query, ip:port/collection1/select?q=trinity%20service~50debugQuery=yes, it gives me results as result name=response numFound=111 start=0 maxScore=4.1237307 doc str name=business_name*Trinidad *Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_name*Trinitee *Service/str How to differentiate between fuzzy and proximity. Thanks, Prasi
Re: Solr 4.0 - Fuzzy query and Proximity query
sry , i copied it wrong. Below is the correct analysis. Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services SF triniti servic PF TRNTtriniti SRFKservic HWF TRNTtriniti SRFKservic PSF TRNTtriniti SRFKservic *Query time* ST trinity services SF trinity services LCF trinity services WDF trinity services SF triniti servic PSF triniti servic PF TRNTtriniti SRFKservic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:58 PM, Prasi S prasi1...@gmail.com wrote: hi Erick, Yes it is correct. These results are because of stemming + phonetic matching. Below is the Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services Query time SF triniti servic PF TRNT triniti SRFK servic HWF TRNT triniti SRFK servic PSF TRNT triniti SRFK servic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.comwrote: The first thing I'd recommend is to look at the admin/analysis page. I suspect you aren't seeing fuzzy query results at all, what you're seeing is the result of stemming. Stemming is algorithmic, so sometimes produces very surprising results, i.e. Trinidad and Trinigee may stem to something like triniti. But you didn't provide the field definition so it's just a guess. Best Erick On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote: Hi, with solr 4.0 the fuzzy query syntax is like keyword~1 (or 2) Proximity search is like value~20. How does this differentiate between the two searches. My thought was promiximity would be on phrases and fuzzy on individual words. Is that correct? I wasnted to do a promiximity search for text field and gave the below query, ip:port/collection1/select?q=trinity%20service~50debugQuery=yes, it gives me results as result name=response numFound=111 start=0 maxScore=4.1237307 doc str name=business_name*Trinidad *Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_name*Trinitee *Service/str How to differentiate between fuzzy and proximity. Thanks, Prasi
Re: Solr 4.0 - Fuzzy query and Proximity query
No, ComplexPhraseQuery has been around for quite a while but never incorporated into the code base, it's pretty much what you need to do both fuzzy and phrase at once. But, doesn't phonetic really incorporate at least a flavor of fuzzy? Is it close enough for your needs to just do phonetic matches? Best Erick On Wed, Aug 28, 2013 at 8:31 AM, Prasi S prasi1...@gmail.com wrote: sry , i copied it wrong. Below is the correct analysis. Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services SF triniti servic PF TRNTtriniti SRFKservic HWF TRNTtriniti SRFKservic PSF TRNTtriniti SRFKservic *Query time* ST trinity services SF trinity services LCF trinity services WDF trinity services SF triniti servic PSF triniti servic PF TRNTtriniti SRFKservic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:58 PM, Prasi S prasi1...@gmail.com wrote: hi Erick, Yes it is correct. These results are because of stemming + phonetic matching. Below is the Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services Query time SF triniti servic PF TRNT triniti SRFK servic HWF TRNT triniti SRFK servic PSF TRNT triniti SRFK servic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.com wrote: The first thing I'd recommend is to look at the admin/analysis page. I suspect you aren't seeing fuzzy query results at all, what you're seeing is the result of stemming. Stemming is algorithmic, so sometimes produces very surprising results, i.e. Trinidad and Trinigee may stem to something like triniti. But you didn't provide the field definition so it's just a guess. Best Erick On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote: Hi, with solr 4.0 the fuzzy query syntax is like keyword~1 (or 2) Proximity search is like value~20. How does this differentiate between the two searches. My thought was promiximity would be on phrases and fuzzy on individual words. Is that correct? I wasnted to do a promiximity search for text field and gave the below query, ip:port/collection1/select?q=trinity%20service~50debugQuery=yes, it gives me results as result name=response numFound=111 start=0 maxScore=4.1237307 doc str name=business_name*Trinidad *Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_name*Trinitee *Service/str How to differentiate between fuzzy and proximity. Thanks, Prasi
Re: Data Centre recovery/replication, does this seem plausible?
The separate DC problem has been lurking for a while. But your understanding it a little off. When a replica discovers that it's too far out of date, it does an old-style replication. IOW, the tlog doesn't contain the entire delta. Eventually, the old-style replications catch up to close enough and _then_ the remaining docs in the tlog are replayed. The target number of updates in the tlog is 100 so it's a pretty small window that's actually replayed in the normal case. None of which helps your problem. The simplest way (and on the expectation that DC outages were pretty rare!) would be to have your indexing process fire the missed updates at the DC after it came back up. Copying from one DC to another is tricky. You'd have to be very, very sure that you copied indexes to the right shard. Ditto for any process that tried to have, say, a single node from the recovering DC temporarily join the good DC, at least long enough to synch. Not a pretty problem, we don't really have any best practices yet that I know of. FWIW, Erick On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.comwrote: We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? In essence, we want to do what happens within Solr cloud's recovery, so (as I understand cloud recovery) a node starts up, (I'm assuming worst case and peer sync has failed) then buffers all updates into the transaction log, replicates from the leader, and replays the transaction log to get everything in sync. Is it conceivable to do the same by extending Solr, so on the activation of some handler (user triggered), we initiated a replicate from other DC, which puts all the leaders into buffering updates, replicate from some other set of servers and then replay? Our goal is to try to minimize the downtime (beyond the initial outage), so we would ideally like to be able to start up indexing before this replicate/clone has finished, that's why I thought to enable buffering on the transaction log. Searches shouldn't be sent here, but if they do we have a valid (albeit old) index to serve those until the new one swaps in. Just curious how any other DC-aware setups handle this kind of scenario? Or other concerns, issues with this type of approach.
NPE during distributed search
Solr 4.3.1 container: jetty 9 (jetty-distribution-9.0.4.v20130625) shard sizes: between 10G and 15G two cores per shard, non SolrCloud mode We have frontend solr and several shards. When searching in smaller amount of shards, the query runs ok. When asking for larger amount of shards, the query fails with NPE. Looking into the corresponding code, we see score comparison: class: ShardDoc.java method: static Comparator comparatorScore(final String fieldName) Code:final float f1 = e1.score; Looks like e1 is null. What could be the reason? Is it at all possible to remove scoring altogether (because we don't need that)? What else should we look into? NPE stack trace: ERROR org.apache.solr.servlet.SolrDispatchFilter - null:java.lang.NullPointerException at org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardDoc.java:234) at org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:159) at org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:101) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:231) at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:140) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156) at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:863) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:625) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:604) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:311) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1094) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1028) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:317) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:267) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:224) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532) at java.lang.Thread.run(Thread.java:722)
how to sum a field grouping by more fields
Hello, can somebody tell me, if solr 4.4.0 support *stats.pivot* in order to sum a field grouping by more fields. Are there another methods to sum a field grouping by more fields? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-sum-a-field-grouping-by-more-fields-tp4087003.html Sent from the Solr - User mailing list archive at Nabble.com.
Group/distinct
Hi I have a set of collections containing documents with the fields: a, b and timestamp A LOT of documents and a lot of them have same values for a, and for each value of a there is only a very limited set of distinct values in the b's. The timestamp-values are different for (almost) all documents. Can I make a group/distinct query to Solr returning all distinct values of a where timestamp is within a certain period of time. If yes, how? Guess this is just using group of facet, but what is the difference and which one is best? Do any of them require that the fields has been prepared for grouping/faceting by setting it up in the schema? Can I make a query to Solr returning all distinct values of a where timestamp is within a certain period of time, and also, for each distinct a, have the limited set of distinct b-values returned? I guess this will beg grouping/faceting on multiple fields, but can you do that? Other suggestions on how to achieve this? Regards, Per Steffensen
Re: Data Centre recovery/replication, does this seem plausible?
I've been thinking about this one too and was curious about using the Solr Entity support in the DIH to do the import from one DC to another (for the lost docs). In my mind, one configures the DIH to use the SolrEntityProcessor with a query to capture the docs in the DC that stayed online, most likely using a timestamp in the query (see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor). Would that work? If so, any downsides? I've only used DIH / SolrEntityProcessor to populate a staging / dev environment from prod but have had good success with it. Thanks. Tim On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson erickerick...@gmail.comwrote: The separate DC problem has been lurking for a while. But your understanding it a little off. When a replica discovers that it's too far out of date, it does an old-style replication. IOW, the tlog doesn't contain the entire delta. Eventually, the old-style replications catch up to close enough and _then_ the remaining docs in the tlog are replayed. The target number of updates in the tlog is 100 so it's a pretty small window that's actually replayed in the normal case. None of which helps your problem. The simplest way (and on the expectation that DC outages were pretty rare!) would be to have your indexing process fire the missed updates at the DC after it came back up. Copying from one DC to another is tricky. You'd have to be very, very sure that you copied indexes to the right shard. Ditto for any process that tried to have, say, a single node from the recovering DC temporarily join the good DC, at least long enough to synch. Not a pretty problem, we don't really have any best practices yet that I know of. FWIW, Erick On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.com wrote: We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? In essence, we want to do what happens within Solr cloud's recovery, so (as I understand cloud recovery) a node starts up, (I'm assuming worst case and peer sync has failed) then buffers all updates into the transaction log, replicates from the leader, and replays the transaction log to get everything in sync. Is it conceivable to do the same by extending Solr, so on the activation of some handler (user triggered), we initiated a replicate from other DC, which puts all the leaders into buffering updates, replicate from some other set of servers and then replay? Our goal is to try to minimize the downtime (beyond the initial outage), so we would ideally like to be able to start up indexing before this replicate/clone has finished, that's why I thought to enable buffering on the transaction log. Searches shouldn't be sent here, but if they do we have a valid (albeit old) index to serve those until the new one swaps in. Just curious how any other DC-aware setups handle this kind of scenario? Or other concerns, issues with this type of approach.
Re: Data Centre recovery/replication, does this seem plausible?
If you can satisfy this statement then it seems possible. This is the same restirction as atomic updates.: The SolrEntityProcessor can only copy fields that are stored in the source index. On Wed, Aug 28, 2013 at 9:41 AM, Timothy Potter thelabd...@gmail.comwrote: I've been thinking about this one too and was curious about using the Solr Entity support in the DIH to do the import from one DC to another (for the lost docs). In my mind, one configures the DIH to use the SolrEntityProcessor with a query to capture the docs in the DC that stayed online, most likely using a timestamp in the query (see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor). Would that work? If so, any downsides? I've only used DIH / SolrEntityProcessor to populate a staging / dev environment from prod but have had good success with it. Thanks. Tim On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson erickerick...@gmail.com wrote: The separate DC problem has been lurking for a while. But your understanding it a little off. When a replica discovers that it's too far out of date, it does an old-style replication. IOW, the tlog doesn't contain the entire delta. Eventually, the old-style replications catch up to close enough and _then_ the remaining docs in the tlog are replayed. The target number of updates in the tlog is 100 so it's a pretty small window that's actually replayed in the normal case. None of which helps your problem. The simplest way (and on the expectation that DC outages were pretty rare!) would be to have your indexing process fire the missed updates at the DC after it came back up. Copying from one DC to another is tricky. You'd have to be very, very sure that you copied indexes to the right shard. Ditto for any process that tried to have, say, a single node from the recovering DC temporarily join the good DC, at least long enough to synch. Not a pretty problem, we don't really have any best practices yet that I know of. FWIW, Erick On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.com wrote: We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? In essence, we want to do what happens within Solr cloud's recovery, so (as I understand cloud recovery) a node starts up, (I'm assuming worst case and peer sync has failed) then buffers all updates into the transaction log, replicates from the leader, and replays the transaction log to get everything in sync. Is it conceivable to do the same by extending Solr, so on the activation of some handler (user triggered), we initiated a replicate from other DC, which puts all the leaders into buffering updates, replicate from some other set of servers and then replay? Our goal is to try to minimize the downtime (beyond the initial outage), so we would ideally like to be able to start up indexing before this replicate/clone has finished, that's why I thought to enable buffering on the transaction log. Searches shouldn't be sent here, but if they do we have a valid (albeit old) index to serve those until the new one swaps in. Just curious how any other DC-aware setups handle this kind of scenario? Or other concerns, issues with this type of approach.
RE: Data Centre recovery/replication, does this seem plausible?
Hi - You're going to miss unstored but indexed fields. We stop any indexing process, kill the servlets on the down DC and copy over the files using scp, then remove the lock file and start it up again. Always works but it's a manual process at this point but should be easy to automate using some simple bash scripting. -Original message- From:Timothy Potter thelabd...@gmail.com Sent: Wednesday 28th August 2013 15:41 To: solr-user@lucene.apache.org Subject: Re: Data Centre recovery/replication, does this seem plausible? I've been thinking about this one too and was curious about using the Solr Entity support in the DIH to do the import from one DC to another (for the lost docs). In my mind, one configures the DIH to use the SolrEntityProcessor with a query to capture the docs in the DC that stayed online, most likely using a timestamp in the query (see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor). Would that work? If so, any downsides? I've only used DIH / SolrEntityProcessor to populate a staging / dev environment from prod but have had good success with it. Thanks. Tim On Wed, Aug 28, 2013 at 6:59 AM, Erick Erickson erickerick...@gmail.comwrote: The separate DC problem has been lurking for a while. But your understanding it a little off. When a replica discovers that it's too far out of date, it does an old-style replication. IOW, the tlog doesn't contain the entire delta. Eventually, the old-style replications catch up to close enough and _then_ the remaining docs in the tlog are replayed. The target number of updates in the tlog is 100 so it's a pretty small window that's actually replayed in the normal case. None of which helps your problem. The simplest way (and on the expectation that DC outages were pretty rare!) would be to have your indexing process fire the missed updates at the DC after it came back up. Copying from one DC to another is tricky. You'd have to be very, very sure that you copied indexes to the right shard. Ditto for any process that tried to have, say, a single node from the recovering DC temporarily join the good DC, at least long enough to synch. Not a pretty problem, we don't really have any best practices yet that I know of. FWIW, Erick On Wed, Aug 28, 2013 at 8:13 AM, Daniel Collins danwcoll...@gmail.com wrote: We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? In essence, we want to do what happens within Solr cloud's recovery, so (as I understand cloud recovery) a node starts up, (I'm assuming worst case and peer sync has failed) then buffers all updates into the transaction log, replicates from the leader, and replays the transaction log to get everything in sync. Is it conceivable to do the same by extending Solr, so on the activation of some handler (user triggered), we initiated a replicate from other DC, which puts all the leaders into buffering updates, replicate from some other set of servers and then replay? Our goal is to try to minimize the downtime (beyond the initial outage), so we would ideally like to be able to start up indexing before this replicate/clone has finished, that's why I thought to enable buffering on the transaction log. Searches shouldn't be sent here, but if they do we have a valid (albeit old) index to serve those until the new one swaps in. Just curious how any other DC-aware setups handle this kind of scenario? Or other concerns, issues with this type of approach.
RE: SOLR 4.2.1 - High Resident Memory Usage
Hi - it's certainly not a rule of thumb but usually RES always grows higher than Xmx so keep an eye on it. -Original message- From:vsilgalis vsilga...@gmail.com Sent: Wednesday 28th August 2013 2:53 To: solr-user@lucene.apache.org Subject: Re: SOLR 4.2.1 - High Resident Memory Usage http://lucene.472066.n3.nabble.com/file/n4086923/huge.png That doesn't seem to be a problem. Markus, are you saying that I should plan on resident memory being at least double my heap size? I haven't run into issues around this before but then again I don't know everything. Is this a rule of thumb or is their documentation I can look at. Thanks again. -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-2-1-High-Resident-Memory-Usage-tp4086866p4086923.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple replicas for specific shard
Thanks Erik, I think this answers my question -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-replicas-for-specific-shard-tp4086828p4087019.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Centre recovery/replication, does this seem plausible?
On 8/28/2013 6:13 AM, Daniel Collins wrote: We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? One way which would work (if your core name structures were identical between the two clouds) would be to shut down your indexing process, shut down the cloud that went down and has now come back up, and rsync from the good cloud. Depending on the index size, that could take a long time, and the index updates would be turned off while it's happening. That makes this idea less than ideal. I have a similar setup on a sharded index that's NOT using SolrCloud, and both copies are in one location instead of two separate data centers. My general indexing method would work for your setup, though. The way that I handle this is that my indexing program tracks its update position for each copy of the index independently. If one copy is down, the tracked position for that index won't get updated, so the next time it comes up, all missed updates will get done for that copy. In the meantime, the program (Java, using SolrJ) is happily using a separate thread to continue updating the index copy that's still up. Thanks, Shawn
Re: Solr 4.0 - Fuzzy query and Proximity query
Mixing fuzzy with phonetic can give bizarre matches. I worked on a search engine that did that. You really don't want to mix stemming, phonetic, and fuzzy. They are distinct transformations of the surface word that do different things. Stemming: conflate different inflections of the same word, like car and cars. Phonetic: conflate words that sound similar, like moody and mudie. Fuzzy: conflate words with different spellings or misspellings, like smith, smyth, and smit. If you want all of these, make three fields with separate transformations. wunder On Aug 28, 2013, at 5:46 AM, Erick Erickson wrote: No, ComplexPhraseQuery has been around for quite a while but never incorporated into the code base, it's pretty much what you need to do both fuzzy and phrase at once. But, doesn't phonetic really incorporate at least a flavor of fuzzy? Is it close enough for your needs to just do phonetic matches? Best Erick On Wed, Aug 28, 2013 at 8:31 AM, Prasi S prasi1...@gmail.com wrote: sry , i copied it wrong. Below is the correct analysis. Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services SF triniti servic PF TRNTtriniti SRFKservic HWF TRNTtriniti SRFKservic PSF TRNTtriniti SRFKservic *Query time* ST trinity services SF trinity services LCF trinity services WDF trinity services SF triniti servic PSF triniti servic PF TRNTtriniti SRFKservic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:58 PM, Prasi S prasi1...@gmail.com wrote: hi Erick, Yes it is correct. These results are because of stemming + phonetic matching. Below is the Index time ST trinity services SF trinity services LCF trinity services SF trinity services SF trinity services WDF trinity services Query time SF triniti servic PF TRNT triniti SRFK servic HWF TRNT triniti SRFK servic PSF TRNT triniti SRFK servic Apart from this, fuzzy would be for indivual words and proximity would be phrase. Is this correct. also can we have fuzzy on phrases? On Wed, Aug 28, 2013 at 5:36 PM, Erick Erickson erickerick...@gmail.com wrote: The first thing I'd recommend is to look at the admin/analysis page. I suspect you aren't seeing fuzzy query results at all, what you're seeing is the result of stemming. Stemming is algorithmic, so sometimes produces very surprising results, i.e. Trinidad and Trinigee may stem to something like triniti. But you didn't provide the field definition so it's just a guess. Best Erick On Wed, Aug 28, 2013 at 7:43 AM, Prasi S prasi1...@gmail.com wrote: Hi, with solr 4.0 the fuzzy query syntax is like keyword~1 (or 2) Proximity search is like value~20. How does this differentiate between the two searches. My thought was promiximity would be on phrases and fuzzy on individual words. Is that correct? I wasnted to do a promiximity search for text field and gave the below query, ip:port/collection1/select?q=trinity%20service~50debugQuery=yes, it gives me results as result name=response numFound=111 start=0 maxScore=4.1237307 doc str name=business_name*Trinidad *Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_nameTrinity Services/str /doc doc str name=business_name*Trinitee *Service/str How to differentiate between fuzzy and proximity. Thanks, Prasi -- Walter Underwood wun...@wunderwood.org
Re: ICUTokenizer class not found with Solr 4.4
Thanks Shawn and Naomi, I think I am running into the same bug, but the symptoms are a bit different. I'm wondering if it makes sense to file a separate linked bug report. The workaround is to remove sharedLib from solr.xml, The solr.xml that comes out-of-the-box does not have a sharedLib. I am using Solr 4.4. out-of-the-box, with the exception that I set up a lib directory in example/solr/collection1. I did not change solr.xml from the out-of-the-box. There is no mention of lib in the out-of-the-box example/solr/solr.xml. I did not change the out-of-the-box solrconfig.xml. According to the README.txt, all that needs to be done is create the collection1/lib directory and put the jars there. However, I am getting the class not found error. Should I open another bug report or comment on the existing report? Tom On Tue, Aug 27, 2013 at 6:48 PM, Shawn Heisey s...@elyograg.org wrote: On 8/27/2013 4:29 PM, Tom Burton-West wrote: According to the README.txt in solr-4.4.0/solr/example/solr/** collection1, all we have to do is create a collection1/lib directory and put whatever jars we want in there. .. /lib. If it exists, Solr will load any Jars found in this directory and use them to resolve any plugins specified in your solrconfig.xml or schema.xml I did so (see below). However, I keep getting a class not found error (see below). Has the default changed from what is documented in the README.txt file? Is there something I have to change in solrconfig.xml or solr.xml to make this work? I looked at SOLR-4852, but don't understand. It sounds like maybe there is a problem if the collection1/lib directory is also specified in solrconfig.xml. But I didn't do that. (i.e. out of the box solrconfig.xml) Does this mean that by following what it says in the README.txt, I am making some kind of a configuration error. I also don't understand the workaround in SOLR-4852. That's my bug! :) If you have sharedLib set to lib (or explicitly the lib directory under solr.solr.home) in solr.xml, then ICUTokenizer cannot be found despite the fact that all the correct jars are there. The workaround is to remove sharedLib from solr.xml, or set it to some other directory that either doesn't exist or has no jars in it. The ${solr.solr.home}/lib directory is automatically added to the classpath regardless of config, there seems to be some kind of classloading bug when the sharedLib adds the same directory again. This all worked fine in 3.x, and early 4.x releases, but due to classloader changes, it seems to have broken. I think (based on the issue description) that it started being a problem with 4.3-SNAPSHOT. The same thing happens if you set sharedLib to foo and put some of your jars in lib and some in foo. It's quite mystifying. Thanks, Shawn
Re: ICUTokenizer class not found with Solr 4.4
My point in the previous e-mail was that following the instructions in the documentation does not seem to work. The workaround I found was to simply change the name of the collection1/lib directory to collection1/foobar and then include it in solrconfig.xml. lib dir=./foobar / This works, but does not explain why out-of-the-box, simply creating a collection1/lib directory and putting the jars there does not work as documented in both the README.txt and in solrconfig.xml. Shawn, should I add these comments to your JIRA issue? Should I open a separate related JIRA issue? Tom Tom On Tue, Aug 27, 2013 at 7:18 PM, Shawn Heisey s...@elyograg.org wrote: On 8/27/2013 5:11 PM, Naomi Dushay wrote: Perhaps you are missing the following from your solrconfig lib dir=/home/blacklight/solr-**home/lib / I ran into this issue (I'm the one that filed SOLR-4852) and I am not using blacklight. I am only using what can be found in a Solr download, plus the MySQL JDBC driver for dataimport. I prefer not to load jars via solrconfig.xml. I have a lot of cores and every core needs to use the same jars. Rather than have the same jars loaded 18 times (once by each of the 18 solrconfig.xml files), I would rather have Solr load them once and make the libraries available to all cores. Using ${solr.solr.home}/lib accomplishes this goal. Thanks, Shawn
Re: ICUTokenizer class not found with Solr 4.4
On 8/28/2013 9:34 AM, Tom Burton-West wrote: I think I am running into the same bug, but the symptoms are a bit different. I'm wondering if it makes sense to file a separate linked bug report. The workaround is to remove sharedLib from solr.xml, The solr.xml that comes out-of-the-box does not have a sharedLib. I am using Solr 4.4. out-of-the-box, with the exception that I set up a lib directory in example/solr/collection1. I did not change solr.xml from the out-of-the-box. There is no mention of lib in the out-of-the-box example/solr/solr.xml. I did not change the out-of-the-box solrconfig.xml. According to the README.txt, all that needs to be done is create the collection1/lib directory and put the jars there. However, I am getting the class not found error. Should I open another bug report or comment on the existing report? I have never heard of using ${instanceDir}/lib for jars. That doesn't mean it won't work, but I have never seen it mentioned anywhere. I have only ever put the lib directory in solr.home, where solr.xml is. Did you try that? If you have seen documentation for collection1/lib, then there may be a doc bug, another dimension to the bug already filed, or a new bug. Do you see log entries saying your jars in collection/lib are loaded? If you do, then I think it's probably another dimension to the existing bug. Thanks, Shawn
Re: Data Centre recovery/replication, does this seem plausible?
Thanks Shawn/Erick for the suggestions. Unfortunately stopping indexing whilst we recover isn't a viable option, we are using Solr as an NRT search platform, so indexing must continue at least on the DC that is fine. If we could stop indexing on the broken DC, then recovery is relatively straightforward, its a rsync/copy of a snapshot from the other data center followed by restarting indexing. The million dollar question is how to start up our existing Solr instances (once the data center has recovered from whatever broke it), realize that we have a gap in indexing (using a checkpointing mechanism similar to what Shawn describes), and recover from that (that's the tricky bit!), without having to interrupt indexing... I know that replication takes up to an hour (its a rather large collection but split into 8 shards currently, and we can replicate each shard in parallel). What ideally I would like to do is at the point that I kick off recovery, divert the indexing feed for the broken into a transaction log on those machines, run the replication and swap the index in, then replay the transaction log to bring it all up to date. That process (conceptually) is the same as the org.apache.solr.cloud.RecoveryStrategy code. Yes, if I could divert that feed a that application level, then I can do what you suggest, but it feels like more work to do that (and build an external transaction log) whereas the code seems to already be in Solr itself, I just need to hook it all up (famous last words!) Our indexing pipeline does a lot of pre-processing work (its not just pulling data from a database), and since we are only talking about the time taken to do the replication (should be an hour or less), it feels like we ought to be able to store that in a Solr transaction log (i.e. the last point in the indexing pipeline). The plan would be to recover the leaders (1 of each shard) this way, and then use conventional replication/recovery to deal with the local replicas (blank their data area and then they will automatically sync from the local leader). On 28 August 2013 15:26, Shawn Heisey s...@elyograg.org wrote: On 8/28/2013 6:13 AM, Daniel Collins wrote: We have 2 separate data centers in our organisation, and in order to maintain the ZK quorum during any DC outage, we have 2 separate Solr clouds, one in each DC with separate ZK ensembles but both are fed with the same indexing data. Now in the event of a DC outage, all our Solr instances go down, and when they come back up, we need some way to recover the lost data. Our thought was to replicate from the working DC, but is there a way to do that whilst still maintaining an online presence for indexing purposes? One way which would work (if your core name structures were identical between the two clouds) would be to shut down your indexing process, shut down the cloud that went down and has now come back up, and rsync from the good cloud. Depending on the index size, that could take a long time, and the index updates would be turned off while it's happening. That makes this idea less than ideal. I have a similar setup on a sharded index that's NOT using SolrCloud, and both copies are in one location instead of two separate data centers. My general indexing method would work for your setup, though. The way that I handle this is that my indexing program tracks its update position for each copy of the index independently. If one copy is down, the tracked position for that index won't get updated, so the next time it comes up, all missed updates will get done for that copy. In the meantime, the program (Java, using SolrJ) is happily using a separate thread to continue updating the index copy that's still up. Thanks, Shawn
Question about SOLR-5017 - Allow sharding based on the value of a field
Hi I'm looking into allowing query joins in solr cloud. This has the limitation of having to index all the documents that are joineable together to the same shard. I'm wondering if SOLR-5017 https://issues.apache.org/jira/browse/SOLR-5017 would give me the ability to do so without implementing my own routing mechanism? If I add a field named parent_id and give that field the same value in all the documents that I want to join, it seems, theoretically, that it will be enough. Am I correct? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-SOLR-5017-Allow-sharding-based-on-the-value-of-a-field-tp4087050.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Centre recovery/replication, does this seem plausible?
On 8/28/2013 10:48 AM, Daniel Collins wrote: What ideally I would like to do is at the point that I kick off recovery, divert the indexing feed for the broken into a transaction log on those machines, run the replication and swap the index in, then replay the transaction log to bring it all up to date. That process (conceptually) is the same as the org.apache.solr.cloud.RecoveryStrategy code. I don't think any such mechanism exists currently. It would be extremely awesome if it did. If there's not an existing Jira issue, I recommend that you file one. Being able to set up a multi-datacenter cloud with automatic recovery would be awesome. Even if it took a long time, having it be fully automated would be exceptionally useful. Yes, if I could divert that feed a that application level, then I can do what you suggest, but it feels like more work to do that (and build an external transaction log) whereas the code seems to already be in Solr itself, I just need to hook it all up (famous last words!) Our indexing pipeline does a lot of pre-processing work (its not just pulling data from a database), and since we are only talking about the time taken to do the replication (should be an hour or less), it feels like we ought to be able to store that in a Solr transaction log (i.e. the last point in the indexing pipeline). I think it would have to be a separate transaction log. One problem with really big regular tlogs is that when Solr gets restarted, the entire transaction log that's currently on the disk gets replayed. If it were big enough to recover the last several hours to a duplicate cloud, it would take forever to replay on Solr restart. If the regular tlog were kept small but a second log with the last 24 hours were available, it could replay updates when the second cloud came back up. I do import from a database, so the application-level tracking works really well for me. Thanks, Shawn
Re: Question about SOLR-5017 - Allow sharding based on the value of a field
I don't know about SOLR-5017, but why don't you want to use parent_id as a shard key? So if you've got a doc with a key of abc123 and a parent_id of 456, just use a key of 456!abc123 and all docs with the same parent_id will go to the same shard. We're doing something similar and limiting queries to the single shard that hosts the relevant docs by setting shard.keys=456! on queries. -Greg On Wed, Aug 28, 2013 at 10:04 AM, adfel70 adfe...@gmail.com wrote: Hi I'm looking into allowing query joins in solr cloud. This has the limitation of having to index all the documents that are joineable together to the same shard. I'm wondering if SOLR-5017 https://issues.apache.org/jira/browse/SOLR-5017 would give me the ability to do so without implementing my own routing mechanism? If I add a field named parent_id and give that field the same value in all the documents that I want to join, it seems, theoretically, that it will be enough. Am I correct? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-about-SOLR-5017-Allow-sharding-based-on-the-value-of-a-field-tp4087050.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud Set up
What is the recommended way to set up Solr so it's HA and fault tolerant? I'm assuming it would be the SolrCloud set up. I'm guessing that Example C (http://wiki.apache.org/solr/SolrCloud) would be the optimum set up. If so, would one set up a load balancer (like f5 or whatever) to direct requests to the Zookeeper instances? Any issues that any of you have run into when setting this up? Suggestions, tips, tricks? -- Jared Griffith Linux Administrator, PICS Auditing, LLC P: (949) 936-4574 C: (909) 653-7814 http://www.picsauditing.com 17701 Cowan #140 | Irvine, CA | 92614 Join PICS on LinkedIn and Twitter! https://twitter.com/PICSAuditingLLC
Re: Storing query results
You could copy the existing core to a new core every once in awhile, and then do your delta indexing into a new core once the copy is complete. If a Persistent URL for the search results included the name of the original core, the results you would get from a bookmark would be stable. However, if you went to the site, and did a new site, you would be searching the newest core. This I think applies whether the site is Intranet or not. Older cores could be aged out gracefully, and the search handler for an old core could be replaced by a search on the new core via sharding. On Fri, Aug 23, 2013 at 11:57 AM, jfeist jfe...@llminc.com wrote: I completely agree. I would prefer to just rerun the search each time. However, we are going to be replacing our rdb based search with something like Solr, and the application currently behaves this way. Our users understand that the search is essentially a snapshot (and I would guess many prefer this over changing results) and we don't want to change existing behavior and confuse anyone. Also, my boss told me it unequivocally has to be this way :p Thanks for your input though, looks like I'm going to have to do something like you've suggested within our application. -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182p4086349.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to Manage RAM Usage at Heavy Indexing
This could be an operating systems problem rather than a Solr problem. CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing and I would read-up up on that. The VM parameters can be tuned in /etc/sysctl.conf On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI furkankam...@gmail.comwrote: Hi Erick; I wanted to get a quick answer that's why I asked my question as that way. Error is as follows: INFO - 2013-08-21 22:01:30.978; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={wt=javabinversion=2} {add=[com.deviantart.reachmeh ere:http/gallery/, com.deviantart.reachstereo:http/, com.deviantart.reachstereo:http/art/SE-mods-313298903, com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/, co m.deviantart.reachthegoddess:http/art/retouched-160219962, com.deviantart.reachthegoddess:http/badges/, com.deviantart.reachthegoddess:http/favourites/, com.deviantart.reachthetop:http/ art/Blue-Jean-Baby-82204657 (1444006227844530177), com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790 ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException; java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early EOF at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722) Caused by: org.eclipse.jetty.io.EofException: early EOF at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65) at java.io.InputStream.read(InputStream.java:101) at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at
Re: More on topic of Meta-search/Federated Search with Solr
On Tue, Aug 27, 2013 at 2:03 AM, Paul Libbrecht p...@hoplahup.net wrote: Dan, if you're bound to federated search then I would say that you need to work on the service guarantees of each of the nodes and, maybe, create strategies to cope with bad nodes. paul +1 I'll think on that.
Re: More on topic of Meta-search/Federated Search with Solr
On Tue, Aug 27, 2013 at 3:33 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Years ago when Federated Search was a buzzword we did some development and testing with Lucene, FAST Search, Google and several other Search Engines according Federated Search in Library context. The results can be found here http://pub.uni-bielefeld.de/download/2516631/2516644 Some minor parts are in German most is written in English. It also gives you an idea where to keep an eye on, where are the pitfalls and so on. We also had a tool called unity (written in Python) which did Federated Search on any Search Engine and Database, like Google, Gigablast, FAST, Lucene, ... The trick with Federated Search is to combine the results. We offered three options to the users search surface: - RoundRobin - Relevancy - PseudoRandom Thanks much - Andrzej B. suggested I read Comparing top-k lists in addition to his Berlin Buzzwords presentation. I will know soon whether we are intent on this direction, right now I'm still trying to think on how hard it will be.
Re: More on topic of Meta-search/Federated Search with Solr
On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha shanuu@gmail.com wrote: Would you like to create something like http://knimbus.com I work at the National Library of Medicine. We are moving our library catalog to a newer platform, and we will probably include articles. The article's content and meta-data are available from a number of web-scale discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's traditional API. Most libraries use open source solutions to avoid the cost of purchasing an expensive enterprise search platform. We are big; we already have a closed-source enterprise search engine (and our own home grown Entrez search used for PubMed).Since we can already do Federated Search with the above, I am evaluating the effort of adding such to Apache Solr. Because NLM data is used in the open relevancy project, we actually have the relevancy decisions to decide whether we have done a good job of it. I obviously think it would be Fun to add Federated Search to Apache Solr. *Standard disclosure *- my opinion's do not represent the opinions of NIH or NLM.Fun is no reason to spend tax-payer money.Enhancing Apache Solr would reduce the risk of putting all our eggs in one basket. and there may be some other relevant benefits. We do use Apache Solr here for more than one other project... so keep up the good work even if my working group decides to go with the closed-source solution.
Re: SolrCloud Set up
On 8/28/2013 11:56 AM, Jared Griffith wrote: What is the recommended way to set up Solr so it's HA and fault tolerant? I'm assuming it would be the SolrCloud set up. I'm guessing that Example C (http://wiki.apache.org/solr/SolrCloud) would be the optimum set up. If so, would one set up a load balancer (like f5 or whatever) to direct requests to the Zookeeper instances? Example C has everything on localhost. That's not really redundant. If you put example C on separate hosts, then it would very likely be redundant. You do not need (or want) a load balancer for zookeeper. If your Solr client code is not written in Java, you might want a load balancer for Solr, though. The java client (SolrJ, specifically the CloudSolrServer class) doesn't require a load balancer for HA. For a SolrCloud setup with HA, you need at least three separate physical hosts. A bare minimum setup has two capable servers that will each run one copy of Solr and one copy of Zookeeper. The third can be less capable and run zookeeper only. If you want to run Solr on all three, you certainly can. You can also add additional nodes for Solr. Additional zookeeper nodes are not required, but if you want them, be sure you have an odd number. You would download zookeeper and follow the instructions to create a three-node replicated setup: http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper For Solr, it's best if you run the latest version, currently 4.4.0. You can put your zkHost parameter (and other solrcloud parameters) in solr.xml. Your zkHost parameter should look like the following, where you use the correct port(s) and a value for the chroot (/mysolr1) that names your cloud: server1:2181,server2:2181,server3:2181/mysolr1 A note on the chroot functionality: By using a different chroot value for each one, you can use one zookeeper ensemble for more than one SolrCloud. SolrCloud doesn't put much load on zookeeper. If you have hundreds of Solr nodes that go up and down a lot, the load would be higher. It's my opinion that you should not use the numShards parameter on the commandline or in solr.xml, or use the startup options for bootstrapping a config. I think it's better to use the zkCli upconfig option to upload config sets to zookeeper, and specify the collection.configName, numShards, and replicationFactor via the Collections API CREATE action. If you want to go to the freenode IRC system (www.freenode.net) and joing the #solr channel, you can get more interactive help. I have no problem sticking with the mailing list either. Thanks, Shawn
Re: Different Responses for 4.4 and 3.5 solr index
We've been seeing changes in our rankings as well. I don't have a definite answer yet, since we're waiting on an index rebuild, but our current working theory is that the change to default omitNorms=true for primitive types may have had an effect, possibly due to follow on confusion: our developers may have omitted norms from some other fields they shouldn't have? -Mike On 08/26/2013 09:46 AM, Stefan Matheis wrote: Did you check the scoring? (use fl=*,score to retrieve it) .. additionally debugQuery=true might provide more information about how the score was calculated. - Stefan On Monday, August 26, 2013 at 12:46 AM, Kuchekar wrote: Hi, The response from 4.4 and 3.5 in the current scenario differs in the sequence in which results are given us back. For example : Response from 3.5 solr is : id:A, id:B, id:C, id:D ... Response from 4.4 solr is : id C, id:A, id:D, id:B... Looking forward your reply. Thanks. Kuchekar, Nilesh On Sun, Aug 25, 2013 at 11:32 AM, Stefan Matheis matheis.ste...@gmail.com (mailto:matheis.ste...@gmail.com)wrote: Kuchekar (hope that's your first name?) you didn't tell us .. how they differ? do you get an actual error? or does the result contain documents you didn't expect? or the other way round, that some are missing you'd expect to be there? - Stefan On Sunday, August 25, 2013 at 4:43 PM, Kuchekar wrote: Hi, We get different response when we query 4.4 and 3.5 solr using same query params. My query param are as following : facet=true facet.mincount=1 facet.limit=25 qf=content^0.0+p_last_name^500.0+p_first_name^50.0+strong_topic^0.0+first_author_topic^0.0+last_author_topic^0.0+title_topic^0.0 wt=javabin version=2 rows=10 f.affiliation_org.facet.limit=150 fl=p_id,p_first_name,p_last_name start=0 q=Apple facet.field=affiliation_org fq=table:profile fq=num_content:[*+TO+1500] fq=name:Apple The content in both (solr 4.4 and solr 3.5) are same. The solrconfig.xml from 3.5 an 4.4 are similarly constructed. Is there something I am missing that might have been changed in 4.4, which might be causing this issue. ?. The qf params looks same. Looking forward for your reply. Thanks. Kuchekar, Nilesh
Re: SolrCloud Set up
We are using Java here. Are you saying that the Solr java client would be aware of the multiple zookeepers and would thus do health / host checks on each zookeeper instance in turn until it got one that is working (assuming that you have one or more zookeepers down)? If that's the case, holy awesome. I'll probably jump in IRC when I actually tackle this set up later on today. On Wed, Aug 28, 2013 at 11:36 AM, Shawn Heisey s...@elyograg.org wrote: On 8/28/2013 11:56 AM, Jared Griffith wrote: What is the recommended way to set up Solr so it's HA and fault tolerant? I'm assuming it would be the SolrCloud set up. I'm guessing that Example C (http://wiki.apache.org/solr/**SolrCloudhttp://wiki.apache.org/solr/SolrCloud) would be the optimum set up. If so, would one set up a load balancer (like f5 or whatever) to direct requests to the Zookeeper instances? Example C has everything on localhost. That's not really redundant. If you put example C on separate hosts, then it would very likely be redundant. You do not need (or want) a load balancer for zookeeper. If your Solr client code is not written in Java, you might want a load balancer for Solr, though. The java client (SolrJ, specifically the CloudSolrServer class) doesn't require a load balancer for HA. For a SolrCloud setup with HA, you need at least three separate physical hosts. A bare minimum setup has two capable servers that will each run one copy of Solr and one copy of Zookeeper. The third can be less capable and run zookeeper only. If you want to run Solr on all three, you certainly can. You can also add additional nodes for Solr. Additional zookeeper nodes are not required, but if you want them, be sure you have an odd number. You would download zookeeper and follow the instructions to create a three-node replicated setup: http://zookeeper.apache.org/**doc/r3.4.5/zookeeperStarted.**html#sc_** RunningReplicatedZooKeeperhttp://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper For Solr, it's best if you run the latest version, currently 4.4.0. You can put your zkHost parameter (and other solrcloud parameters) in solr.xml. Your zkHost parameter should look like the following, where you use the correct port(s) and a value for the chroot (/mysolr1) that names your cloud: server1:2181,server2:2181,**server3:2181/mysolr1 A note on the chroot functionality: By using a different chroot value for each one, you can use one zookeeper ensemble for more than one SolrCloud. SolrCloud doesn't put much load on zookeeper. If you have hundreds of Solr nodes that go up and down a lot, the load would be higher. It's my opinion that you should not use the numShards parameter on the commandline or in solr.xml, or use the startup options for bootstrapping a config. I think it's better to use the zkCli upconfig option to upload config sets to zookeeper, and specify the collection.configName, numShards, and replicationFactor via the Collections API CREATE action. If you want to go to the freenode IRC system (www.freenode.net) and joing the #solr channel, you can get more interactive help. I have no problem sticking with the mailing list either. Thanks, Shawn -- Jared Griffith Linux Administrator, PICS Auditing, LLC P: (949) 936-4574 C: (909) 653-7814 http://www.picsauditing.com 17701 Cowan #140 | Irvine, CA | 92614 Join PICS on LinkedIn and Twitter! https://twitter.com/PICSAuditingLLC
Re: SolrCloud Set up
On 8/28/2013 1:36 PM, Jared Griffith wrote: We are using Java here. Are you saying that the Solr java client would be aware of the multiple zookeepers and would thus do health / host checks on each zookeeper instance in turn until it got one that is working (assuming that you have one or more zookeepers down)? If that's the case, holy awesome. I'll probably jump in IRC when I actually tackle this set up later on today. Yes, the Java client is completely aware of the cloud state in realtime. When you create a CloudSolrServer object, you don't tell it where Solr is, you tell it where zookeeper is - using the same (potentially multi-host and including a chroot) zkHost parameter that you give to Solr. Thanks, Shawn
purge and optimize questions for solr 4.4.0
We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes with 500 million documents. We're using custom sharding where we direct all documents with specific business date to specific shard. With Solr 3.6 we used this command to optimize documents on master and then let replication take care of updating documents on slave1 and slave2. curl --proxy 'http://prod-solr-master.xyz.com:8983/solr/core1/update?optimize=truewaitFlush=falsemaxSegments=1' How do we optimize documents for all shards in Solr Cloud? Do we have to fire five different optimize commands to all five leaders? Also, looks like optimize will be going away and might no longer be necessary - see SOLR-3141https://issues.apache.org/jira/browse/SOLR-3141 Is that true? With Solr 3.6 we purge millions of documents every month and then run optimize. We're planning to do same with Solr Cloud set up. With Solr 3.6 we used following curl command to purge documents. Now with multiple shards can we still use the same command? We will definitely experiment with our QA set up of 500 million documents. curl --proxy http://prod-solr-master.xyz.com:8983/solr/core1/update?commit=true -H Content-Type: text/xml --data-binary 'deletequerybusdate_i:[* TO 20130208]/query/delete' Thanks!
coordination factor in between query terms
How can i specify coordination factor between query terms eg. q=termA termB doc1= { field: termA} doc2 = {field: termA termB termC termD } I want doc2 scored higher than doc1 -- Anirudha P. Jadhav
RE: coordination factor in between query terms
Just boost the term you want to show up higher in your results. http://wiki.apache.org/solr/SolrRelevancyCookbook#Boosting_Ranking_Terms - Greg -Original Message- From: anirudh...@gmail.com [mailto:anirudh...@gmail.com] On Behalf Of Anirudha Jadhav Sent: Wednesday, August 28, 2013 3:36 PM To: solr-user@lucene.apache.org Subject: coordination factor in between query terms How can i specify coordination factor between query terms eg. q=termA termB doc1= { field: termA} doc2 = {field: termA termB termC termD } I want doc2 scored higher than doc1 -- Anirudha P. Jadhav
Re: coordination factor in between query terms
i don't know what term to boost. I just need the documents with both terms listed as ranked higher. but since Doc1 is smaller and has an exact match on the term as per tf-idf is ranked higher. On Wed, Aug 28, 2013 at 4:47 PM, Greg Walters gwalt...@sherpaanalytics.comwrote: Just boost the term you want to show up higher in your results. http://wiki.apache.org/solr/SolrRelevancyCookbook#Boosting_Ranking_Terms - Greg -Original Message- From: anirudh...@gmail.com [mailto:anirudh...@gmail.com] On Behalf Of Anirudha Jadhav Sent: Wednesday, August 28, 2013 3:36 PM To: solr-user@lucene.apache.org Subject: coordination factor in between query terms How can i specify coordination factor between query terms eg. q=termA termB doc1= { field: termA} doc2 = {field: termA termB termC termD } I want doc2 scored higher than doc1 -- Anirudha P. Jadhav -- Anirudha P. Jadhav
Re: coordination factor in between query terms
1) Coordination factor is controlled by the Similarity you have configured -- there is no request time option to affect hte coordination function. the Default Similarity already includes a simple ratio coord factor... https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#coord%28int,%20int%29 2) your example query includes quote characters which makes it a phrase query, not a simple boolean query, so in that case both termA and termB will be required, and must be within the default slop number of term positions away from eachother. if you instead used a query param of: q=termA termB ... then you'd see the coord factor come into play 3) in addition to the coord factor is the issue of fieldNorms -- but default text fields include a norm factor that takes into account thelength of a field, so in spite of the coord factor a very short field (ie; doc1) might score higher then a long field (ie: doc2) even if the lon field has more matches -- if you odn't want this, just use omitNorms=true on your field. : How can i specify coordination factor between query terms : eg. q=termA termB : : doc1= { field: termA} : doc2 = {field: termA termB termC termD } : : I want doc2 scored higher than doc1 : : -- : Anirudha P. Jadhav : -Hoss
Re: ICUTokenizer class not found with Solr 4.4
Hi Shawn, I'm going to add this to the your JIRA unless you think that it would be good to open another issue. The issue for me is that making a ./lib in the instanceDir is documented as working in several places and has worked in previous versions of Solr, for example solr 4.1.0. I make a ./lib directory in Solr Home, all works just fine. However according to the documentation making a ./lib directory in the instanceDir should work, and in fact in Solr 4.1.0 it works just fine. So the question for me is whether making a ./lib directory as documented in collections1/conf/solrconfig.xml and collections1/README.txt is supposed to work in Solr 4.4 , but due to a bug it is not working. If it is not supposed to work, then the documentation needs fixing and some note needs to be made about upgrading from previous versions of Solr. Do you think I should open another JIRA and link it to yours or just add this information (i.e. other scenarios where class loading not working) to your JIRA? Details below: Tom The documentation in the collections1/conf directory is confusing. For example the collections1/conf/solrconfig.xml file says you should put a ./lib dir in your instanceDir. (Am I correct that an instanceDir refers to the core? ) On the other hand the documentation in the collections1/README.txt is confusing about whether it is talking about the instanceDir or Solr Home: For example, In collections1/conf/solrconfig.xml there is this comment: If a ./lib directory exists in your instanceDir, all files found in it are included as if you had used the following syntax... lib dir=./lib / Also in collections1/conf/README.txt it is suggested that you use ./lib but that README.txt file needs editing as it is very confusing about whether it is talking about Solr Home or the Instance Directory in the text excerpted below. I would assume that the conf and data directories have to be subdirectories of the instanceDir, since I assume they are set per core. So in the excerpt below the discussion of the sub-directories should apply to the instanceDir not Solr Home. Example SolrCore Instance Directory = This directory is provided as an example of what an Instance Directory should look like for a SolrCore It's not strictly necessary that you copy all of the files in this directory when setting up a new SolrCores, but it is recommended. Basic Directory Structure - The Solr Home directory typically contains the following sub-directorie: conf/ This directory is mandatory and must contain your solrconfig.xml and schema.xml. Any other optional configuration files would also be kept here. data/ This directory is the default location where Solr will keep your ... lib/ On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey s...@elyograg.org wrote: On 8/28/2013 9:34 AM, Tom Burton-West wrote: I think I am running into the same bug, but the symptoms are a bit different. I'm wondering if it makes sense to file a separate linked bug report. The workaround is to remove sharedLib from solr.xml, The solr.xml that comes out-of-the-box does not have a sharedLib. I am using Solr 4.4. out-of-the-box, with the exception that I set up a lib directory in example/solr/collection1. I did not change solr.xml from the out-of-the-box. There is no mention of lib in the out-of-the-box example/solr/solr.xml. I did not change the out-of-the-box solrconfig.xml. According to the README.txt, all that needs to be done is create the collection1/lib directory and put the jars there. However, I am getting the class not found error. Should I open another bug report or comment on the existing report? I have never heard of using ${instanceDir}/lib for jars. That doesn't mean it won't work, but I have never seen it mentioned anywhere. I have only ever put the lib directory in solr.home, where solr.xml is. Did you try that? If you have seen documentation for collection1/lib, then there may be a doc bug, another dimension to the bug already filed, or a new bug. Do you see log entries saying your jars in collection/lib are loaded? If you do, then I think it's probably another dimension to the existing bug. Thanks, Shawn
Re: coordination factor in between query terms
my bad, typo there q=termA termB i know omitNorms is indexTime field option, can it be applied to the query also? are there other solutions to this kind of a problem? curious On Wed, Aug 28, 2013 at 4:52 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: 1) Coordination factor is controlled by the Similarity you have configured -- there is no request time option to affect hte coordination function. the Default Similarity already includes a simple ratio coord factor... https://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#coord%28int,%20int%29 2) your example query includes quote characters which makes it a phrase query, not a simple boolean query, so in that case both termA and termB will be required, and must be within the default slop number of term positions away from eachother. if you instead used a query param of: q=termA termB ... then you'd see the coord factor come into play 3) in addition to the coord factor is the issue of fieldNorms -- but default text fields include a norm factor that takes into account thelength of a field, so in spite of the coord factor a very short field (ie; doc1) might score higher then a long field (ie: doc2) even if the lon field has more matches -- if you odn't want this, just use omitNorms=true on your field. : How can i specify coordination factor between query terms : eg. q=termA termB : : doc1= { field: termA} : doc2 = {field: termA termB termC termD } : : I want doc2 scored higher than doc1 : : -- : Anirudha P. Jadhav : -Hoss -- Anirudha P. Jadhav
Re: ICUTokenizer class not found with Solr 4.4
On 8/28/2013 2:59 PM, Tom Burton-West wrote: Do you think I should open another JIRA and link it to yours or just add this information (i.e. other scenarios where class loading not working) to your JIRA? The documentation does sound confused. My personal opinion (which may not be what ends up happening) is that ${instanceDir}/lib shouldn't continue to be supported, at least not implicitly without config, mostly because each instanceDir can be dynamically destroyed (and added, with SolrCloud) by the core and collection APIs. I am guessing that you are seeing the same issue that has already been documented. The little research I've done into this suggests that some classes (ICUTokenizer being the specific example here) don't like it when Solr replaces the classloader to add additional jars. This is probably the case no matter which part of the config (solr.xml or solrconfig.xml) tells Solr to replace the classloader. The safest thing I've found is to use the lib directory off solr.home (which gets automatically used) and don't specify any additional lib directories anywhere in the configuration. Thanks, Shawn
What does it mean when a shard is down in solr4.4?
I have a 3 node solrcloud cluster with 3 shards for each collection/core. At times when I rebuild the index say on collectionA on nodeA (shard1) via UpdateCSV, the Cloud status page says that collectionA on nodeA (shard1) is down. Observations: 1. Other collections on nodeA work. 2. collectionA on nodeB and nodeC works. 3. nodeA's solr admin is accessible too. So my questions are: 1. What does it really mean when a shard goes down? 2. How can I recover from that state? Solr cloud screenshot: http://i.imgur.com/2TgKXiC.png -- Thanks, -Utkarsh
Re: Solr show total row count in response of full import
: It would be nice if you could receive a total row count like : : str name=Total Documents10100/str : : With this information we could add another information like : : str name=Imported in Percent 62.91/str : : This would make it easier to generate a progress bar for the end user. I don't think that's possible -- DIH has no way of knowing in advance the total number of documents that the DataSources are going to produce. -Hoss
Re: Filter cache pollution during sharded edismax queries
Ken ... i'm not really sure i'm understanding what you're trying to describe. can you give the full details of a concrete example of what you are seeing? * full requestHandler config * example of query issued by client * every request logged on each shard * contends of filterCache and queryResultCache after client's query finishes -Hoss
Re: purge and optimize questions for solr 4.4.0
: We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes with 500 : million documents. We're using custom sharding where we direct all : documents with specific business date to specific shard. ... : How do we optimize documents for all shards in Solr Cloud? Do we have to : fire five different optimize commands to all five leaders? Also, looks Commands like Optimize and deleteByQuery are automatically propogated to all shards -- you only need to send that command to one node in the collection. : like optimize will be going away and might no longer be necessary - see : SOLR-3141https://issues.apache.org/jira/browse/SOLR-3141 Is that true? it's still up for debate, and as you can see from the comments hasn't had much traction lately. Even if, at some point in the future, sending a command named optimize ceasees to work, the underlying functinoality of being able to say force merge down to N segments will always exist under some name, provided you don't go out of your way to use a MergePolicy that ignores that command. : With Solr 3.6 we used following curl command to purge documents. Now : with multiple shards can we still use the same command? We will as mentioned above, a deleteByQuery command can be sent to a single node and it will be propogated automatically. However: if you are already using custom sharding to shard by date, then a blanket deleteByQuery across all shards may not be neccessary -- you may find it easier/faster/cleaner to just delete the shards you no longer need as the data in them expires ... https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-DeleteaShard -Hoss
Feedback requested on design/implementation/extent of a proposed Solr configuration REST API
For mailing list participants on solr-user who aren't subscribed to the dev list: I've created a JIRA issue to discuss adding a Solr configuration REST API: https://issues.apache.org/jira/browse/SOLR-5200. I'm interested in feedback of any kind on this proposal, preferably on the above-linked JIRA issue, but here on the solr-user mailing list would also work. There are lots of details, so I don't expect quick resolution, but any input about rationale or use cases for inclusion or exclusion of configuration item runtime modifiability would be very useful. Thanks, Steve
Re: why does a node switch state ?
Hi Daniel, thank you very much for your reply. However, my zkTimeout in solr.xml is 30s. cores adminPath=/admin/cores defaultCoreName=doc host=${host:215.lead.index.com} hostPort=${jetty.port:9090} hostContext=${hostContext:} zkClientTimeout=${zkClientTimeout:3} leaderVoteWait=${leaderVoteWait:2} ... /cores -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-a-node-switch-state-tp4086939p4087142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why does a node switch state ?
Kindly stop me from solr mail chain. Thanks and regards, Veena On Wed, Aug 28, 2013 at 12:55 PM, sling sling...@gmail.com wrote: hi, I have a solrcloud with 8 jvm, which has 4 shards(2 nodes for each shard). 1000 000 docs are indexed per day, and 10 query requests per second, and sometimes, maybe there are 100 query requests per second. in each shard, one jvm has 8G ram, and another has 5G. the jvm args is like this: -Xmx5000m -Xms5000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection -XX:+PrintGCDateStamps -XX:+PrintGC -Xloggc:log/jvmsolr.log OR -Xmx8000m -Xms8000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection -XX:+PrintGC -XX:+PrintGCDateStamps -Xloggc:log/jvmsolr.log Nodes works well, but also switch state every day (at the same time, gc becomes abnormal like below). 2013-08-28T13:29:39.140+0800: 97180.866: [GC 3770296K-2232626K(4608000K), 0.0099250 secs] 2013-08-28T13:30:09.324+0800: 97211.050: [GC 3765732K-2241711K(4608000K), 0.0124890 secs] 2013-08-28T13:30:29.777+0800: 97231.504: [GC 3760694K-2736863K(4608000K), 0.0695530 secs] 2013-08-28T13:31:02.887+0800: 97264.613: [GC 4258337K-4354810K(4608000K), 0.1374600 secs] 97264.752: [Full GC 4354810K-2599431K(4608000K), 6.7833960 secs] 2013-08-28T13:31:09.884+0800: 97271.610: [GC 2750517K(4608000K), 0.0054320 secs] 2013-08-28T13:31:15.354+0800: 97277.080: [GC 3550474K(4608000K), 0.0871270 secs] 2013-08-28T13:31:31.258+0800: 97292.984: [GC 3877223K(4608000K), 0.1551870 secs] 2013-08-28T13:31:34.396+0800: 97296.123: [GC 3877223K(4608000K), 0.1220380 secs] 2013-08-28T13:31:38.102+0800: 97299.828: [GC 3877225K(4608000K), 0.1545500 secs] 2013-08-28T13:31:40.227+0800: 97303.019: [Full GC 4174941K-2127315K(4608000K), 6.3435150 secs] 2013-08-28T13:31:49.645+0800: 97311.371: [GC 2508466K(4608000K), 0.0355180 secs] 2013-08-28T13:31:57.645+0800: 97319.371: [GC 2967737K(4608000K), 0.0579650 secs] even more, sometimes a shard is down(one node is recovering, another is down), that is an absolute disaster... please help me. any advice is welcome... -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-a-node-switch-state-tp4086939.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Veena Rani P N Banglore. 9538440458
RE: SOLR 4.2.1 - High Resident Memory Usage
So we actually 3 of the 6 machines automatically restart the SOLR service as memory pressure was too high, 2 were by SIGABRT and one was java OOMkiller. I dropped a pmap on one of the solr services before it died. Basically i need to figure out what the other direct memory references are outside of the heap (marked with the arrows in the image below). anyone have any insight? http://lucene.472066.n3.nabble.com/file/n4087148/pmap_03.png -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-2-1-High-Resident-Memory-Usage-tp4086866p4087148.html Sent from the Solr - User mailing list archive at Nabble.com.