Re: Special character and wildcard matching
Thanks, Jack. I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks. That at least verifies that the accented e is stored in the field. I don't see anything wrong here, so it is as if the Lucene prefix query was mapping the accented characters. It's not supposed to do that, but... Go ahead and file a Jira bug. Include all of the details that you provided in this thread. -- Jack Krupansky On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Exact query: /select?q=raw_name:beyonce*wt=jsonfl=raw_name Response: { responseHeader: {status: 0,QTime: 0,params: { fl: raw_name, q: raw_name:beyonce*, wt: json } }, response: {numFound: 2,start: 0,docs: [ {raw_name: beyoncé }, {raw_name: beyoncé }] }} On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com wrote: In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?
Re: Special character and wildcard matching
On 24 February 2015 at 15:50, Jack Krupansky jack.krupan...@gmail.com wrote: It's a string field, so there shouldn't be any analysis. (read back in the thread for the field and field type.) It's a multi-term expansion. There is _some_ analysis one way or another :-) Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
RE: how to debug solr performance degradation
Tang, Rebecca [rebecca.t...@ucsf.edu] wrote: [12-15 second response time instead of 0-3] Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The usual culprit is memory (if you are using spinning drive as your storage). It appears that you have enough raw memory though. Could you check how much memory the machine has free for disk caching? If it is a relative small amount, let's say below 50GB, then please provide a breakdown of what the memory is used for (very large JVM heap for example). I want to pinpoint where the performance issue is coming from. Could I have some suggestions/help on how to benchmark/debug solr performance issues. Rough checking of IOWait and CPU load is a fine starting point. If if is CPU load then you can turn on debug in Solr admin, which should tell you where the time is spend resolving the queries. It it is IOWait then ensure a lot of free memory for disk cache and/or improve your storage speed (SSDs instead of spinning drives, local storage instead of remote). - Toke Eskildsen, State and University Library, Denmark.
Re: Special character and wildcard matching
It's a string field, so there shouldn't be any analysis. (read back in the thread for the field and field type.) -- Jack Krupansky On Tue, Feb 24, 2015 at 3:19 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: What happens if the query does not have wildcard expansion (*)? If the behavior is correct, then the issue is somehow with the MultitermQueryAnalysis (a hidden automatically generated analyzer chain): http://wiki.apache.org/solr/MultitermQueryAnalysis Which would still make it a bug, but at least the cause could be narrowed down. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 February 2015 at 14:56, Arun Rangarajan arunrangara...@gmail.com wrote: Thanks, Jack. I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks. That at least verifies that the accented e is stored in the field. I don't see anything wrong here, so it is as if the Lucene prefix query was mapping the accented characters. It's not supposed to do that, but... Go ahead and file a Jira bug. Include all of the details that you provided in this thread. -- Jack Krupansky On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Exact query: /select?q=raw_name:beyonce*wt=jsonfl=raw_name Response: { responseHeader: {status: 0,QTime: 0,params: { fl: raw_name, q: raw_name:beyonce*, wt: json } }, response: {numFound: 2,start: 0,docs: [ {raw_name: beyoncé }, {raw_name: beyoncé }] }} On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com wrote: In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above
Re: Special character and wildcard matching
Exact query: /select?q=raw_name:beyonce*wt=jsonfl=raw_name Response: { responseHeader: {status: 0,QTime: 0,params: { fl: raw_name, q: raw_name:beyonce*, wt: json } }, response: {numFound: 2,start: 0,docs: [ {raw_name: beyoncé }, {raw_name: beyoncé }] }} On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com wrote: In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?
Re: performance issues with geofilt
Hi Dirk, The RPT field type can be used for distance sorting/boosting but it’s a memory pig when used as-such so don’t do it unless you have to. You only have to if you have a multi-valued point field. If you have single-valued, use LatLonType specifically for distance sorting. Your sample query doesn’t parse correctly for multiple reasons. You can’t put a query into the sort parameter as you have done it. You have to do sort=query($sortQuery) ascsortQuery=… or a slightly different equivalent variation. Lets say you do that… still, I don’t recommend this syntax when you simply want distance sort — just use geodist(), as in: sort=geodist() asc. If you want to use this syntax such as to sort by recipDistance, then it would look like this (note the filter=false hint to the spatial query parser, which otherwise is unaware it shouldn’t bother actually search/filter): sort=query($sortQuery) descsortQuery={!geofilt score=recipDistance filter=false sfield=geometry pt=51.3,12.3 d=1.0} If you are able to use geodist() and still find it slow, there are alternatives involving using projected data and then with simply euclidean calculations, sqedist(): https://wiki.apache.org/solr/FunctionQuery#sqedist_-_Squared_Euclidean_Distance ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Tue, Feb 24, 2015 at 6:12 AM, dirk.thalh...@bkg.bund.de wrote: Hello, we are using solr 4.10.1. There are two cores for different use cases with around 20 million documents (location descriptions) per core. Each document has a geometry field which stores a point and a bbox field which stores a bounding box. Both fields are defined with: fieldType name=t_geometry class=solr.SpatialRecursivePrefixTreeFieldType spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory geo=true distErrPct=0.025 maxDistErr=0.9 units=degrees / I'm currently trying to add a location search (find all documents around a point). My intention is to add this as filter query, so that the user is able to do an additional keyword search. These are the query parameters so far: q=*:*fq=typ:strassefq={!geofilt sfield=geometry pt=51.370570625523,12.369290471603 d=1.0} To sort the documents by their distance to the requested point, I added following sort parameter: sort={!geofilt sort=distance sfield: geometry pt=51.370570625523,12.369290471603 d=1.0} asc Unfortunately I'm experiencing here some major performance/memory problems. The first distance query on a core takes over 10 seconds. In my first setup the same request to the second core completely blocked the server and caused an OutOfMemoryError. I had to increase the memory to 16 GB and now it seems to work for the geometry field. Anyhow the first request after a server restart takes some time and when I try it with the bbox field after a requested on the geometry field in both cores, the server blocks again. Can anyone explain why the distance needs so much memory? Can this be optimized? Kind regards, Dirk
how to debug solr performance degradation
Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 – 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data — every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus I don't know when the performance degradation started. Today was the first time I noticed it, but I haven't used this box for a while. I want to pinpoint where the performance issue is coming from. Could I have some suggestions/help on how to benchmark/debug solr performance issues. Thank you, Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu
Re: 8 Shards of Cloud with 4.10.3.
I guess the place to start is the Reference Guide: https://cwiki.apache.org/confluence/display/solr/SolrCloud Generally speaking, when you start Solr with any sort of Zookeeper, you've entered cloud mode, which essentially means that Solr is now capable of organizing cores into groups that represent shards, and groups of shards are coordinated into collections. Additionally, Zookeeper allows multiple Solr installations to be coordinated together to serve these collections with high availability. If you're just trying to gain parallelism on a single by using multiple cores, you don't specifically need cloud mode or collections. You can create multiple cores, distribute your documents manually to each core, and then do a distributed search ala https://wiki.apache.org/solr/DistributedSearch. The downside here is that you're on your own in terms of distributing the documents at write time, but on the other hand, you don't have to maintain a Zookeeper ensemble or devote brain cells to understanding collections/shards/etc. Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Feb 24, 2015 at 3:21 PM, Benson Margulies bimargul...@gmail.com wrote: On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Benson: Are you trying to run independent invocations of Solr for every node? Otherwise, you'd just want to create a 8 shard collection with maxShardsPerNode set to 8 (or more I guess). Michael Della Bitta, I don't want to run multiple invocations. I just want to exploit hardware cores with shards. Can you point me at doc for the process you are referencing here? I confess to some ongoing confusion between cores and collections. --benson Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com wrote: With so much of the site shifted to 5.0, I'm having a bit of trouble finding what I need, and so I'm hoping that someone can give me a push in the right direction. On a big multi-core machine, I want to set up a configuration with 8 (or perhaps more) nodes treated as shards. I have some very particular solrconfig.xml and schema.xml that I need to use. Could some kind person point me at a relatively step-by-step layout? This is all on Linux, I'm happy to explicitly run Zookeeper.
Re: Special character and wildcard matching
What happens if the query does not have wildcard expansion (*)? If the behavior is correct, then the issue is somehow with the MultitermQueryAnalysis (a hidden automatically generated analyzer chain): http://wiki.apache.org/solr/MultitermQueryAnalysis Which would still make it a bug, but at least the cause could be narrowed down. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 February 2015 at 14:56, Arun Rangarajan arunrangara...@gmail.com wrote: Thanks, Jack. I have filed a tkt: https://issues.apache.org/jira/browse/SOLR-7154 On Tue, Feb 24, 2015 at 11:43 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Thanks. That at least verifies that the accented e is stored in the field. I don't see anything wrong here, so it is as if the Lucene prefix query was mapping the accented characters. It's not supposed to do that, but... Go ahead and file a Jira bug. Include all of the details that you provided in this thread. -- Jack Krupansky On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Exact query: /select?q=raw_name:beyonce*wt=jsonfl=raw_name Response: { responseHeader: {status: 0,QTime: 0,params: { fl: raw_name, q: raw_name:beyonce*, wt: json } }, response: {numFound: 2,start: 0,docs: [ {raw_name: beyoncé }, {raw_name: beyoncé }] }} On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com wrote: In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?
Re: how to debug solr performance degradation
On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 – 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data — every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: 8 Shards of Cloud with 4.10.3.
On 2/24/2015 1:21 PM, Benson Margulies wrote: On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Benson: Are you trying to run independent invocations of Solr for every node? Otherwise, you'd just want to create a 8 shard collection with maxShardsPerNode set to 8 (or more I guess). Michael Della Bitta, I don't want to run multiple invocations. I just want to exploit hardware cores with shards. Can you point me at doc for the process you are referencing here? I confess to some ongoing confusion between cores and collections. SolrCloud is designed around the idea that each machine runs one copy of Solr. Running multiple instances of Solr on one machine is usually a waste of resources, and can lead to problems with SolrCloud high availability (redundancy). Here's a simple way of thinking about the terminology in SolrCloud: Collections are made up of one or more shards. Shards have one or more replicas. Each replica is a core. An important detail: For each shard, one of the replicas is elected leader. SolrCloud gets rid of the master and slave concepts. Thanks, Shawn
Re: Special character and wildcard matching
Thanks. That at least verifies that the accented e is stored in the field. I don't see anything wrong here, so it is as if the Lucene prefix query was mapping the accented characters. It's not supposed to do that, but... Go ahead and file a Jira bug. Include all of the details that you provided in this thread. -- Jack Krupansky On Tue, Feb 24, 2015 at 2:35 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Exact query: /select?q=raw_name:beyonce*wt=jsonfl=raw_name Response: { responseHeader: {status: 0,QTime: 0,params: { fl: raw_name, q: raw_name:beyonce*, wt: json } }, response: {numFound: 2,start: 0,docs: [ {raw_name: beyoncé }, {raw_name: beyoncé }] }} On Tue, Feb 24, 2015 at 11:01 AM, Jack Krupansky jack.krupan...@gmail.com wrote: Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com wrote: In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?
Re: 8 Shards of Cloud with 4.10.3.
On Tue, Feb 24, 2015 at 4:27 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Unfortunately, this is all 5.1 and instructs me to run the 'start from : scratch' process. a) checkout the left nav of any ref guide page webpage which has a link to Older Versions of this Guide (PDF) b) i'm not entirely sure i understand what you're asking, but i'm guessing you mean... * you have a fully functional individual instance of Solr, with a single core * you only want to run that one single instance of the Solr process * you want tha single solr process to be a SolrCould of one node, but replace your single core with a collection that is divided into 8 shards. * presumably: you don't care about replication since you are only trying to run one node. what you want to look into (in the 4.10 ref guide) is how to bootstrap a SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr to take the configs from your single core and uploda them to zk as a configset, and register that single core as a collection. That should give you a single instance of solrcloud, with a single collection, consisting of one shard (your original core) Then you should be able to use the SPLITSHARD command to split your single shard into 2 shards, and then split them again, etc... (i don't think you can split directly to 8-sub shards with a single command) FWIW: unless you no longer have access to the original data, it would almost certainly be a lot easier to just start with a clean install of Solr in cloud mode, then create a collection with 8 shards, then re-index your data. OK, now I'm good to go. Thanks. -Hoss http://www.lucidworks.com/
Re: 8 Shards of Cloud with 4.10.3.
: Unfortunately, this is all 5.1 and instructs me to run the 'start from : scratch' process. a) checkout the left nav of any ref guide page webpage which has a link to Older Versions of this Guide (PDF) b) i'm not entirely sure i understand what you're asking, but i'm guessing you mean... * you have a fully functional individual instance of Solr, with a single core * you only want to run that one single instance of the Solr process * you want tha single solr process to be a SolrCould of one node, but replace your single core with a collection that is divided into 8 shards. * presumably: you don't care about replication since you are only trying to run one node. what you want to look into (in the 4.10 ref guide) is how to bootstrap a SolrCloud instance from a non-SolrCloud node -- ie: startup zk, tell solr to take the configs from your single core and uploda them to zk as a configset, and register that single core as a collection. That should give you a single instance of solrcloud, with a single collection, consisting of one shard (your original core) Then you should be able to use the SPLITSHARD command to split your single shard into 2 shards, and then split them again, etc... (i don't think you can split directly to 8-sub shards with a single command) FWIW: unless you no longer have access to the original data, it would almost certainly be a lot easier to just start with a clean install of Solr in cloud mode, then create a collection with 8 shards, then re-index your data. -Hoss http://www.lucidworks.com/
Solr Spatial search with self-intersecting polygons
Hi, I'm using Solr 4.10.3 As field type in my schema.xml. I'm using location_rpt like the description in the documentation. fieldType name=location_rpt class= solr.SpatialRecursivePrefixTreeFieldType geo=true distErrPct=0.025 maxDistErr=0.09 units=degrees / Everything works good and fine. I'm able to index points and include a POLYGON search in my query. However, it throws an exception in some cases where the Polygon in my query might be intersecting at some point. org.apache.solr.common.SolrException: com.spatial4j.core.exception.InvalidShapeException: Self-intersection at or near point (23.5235632532521, 77.6266564124352, NaN) How could I clean up my polygon so that Solr doesn't throw an exception. Also, a possible solution would be to include only the boundary (in case the polygon intersects at some point). Is it possible to do it on the Solr side? Thanks. -- Prateek Sachan Indian Institute of Technology Delhi
Re: 8 Shards of Cloud with 4.10.3.
On Tue, Feb 24, 2015 at 1:30 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Benson: Are you trying to run independent invocations of Solr for every node? Otherwise, you'd just want to create a 8 shard collection with maxShardsPerNode set to 8 (or more I guess). Michael Della Bitta, I don't want to run multiple invocations. I just want to exploit hardware cores with shards. Can you point me at doc for the process you are referencing here? I confess to some ongoing confusion between cores and collections. --benson Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com wrote: With so much of the site shifted to 5.0, I'm having a bit of trouble finding what I need, and so I'm hoping that someone can give me a push in the right direction. On a big multi-core machine, I want to set up a configuration with 8 (or perhaps more) nodes treated as shards. I have some very particular solrconfig.xml and schema.xml that I need to use. Could some kind person point me at a relatively step-by-step layout? This is all on Linux, I'm happy to explicitly run Zookeeper.
Re: 8 Shards of Cloud with 4.10.3.
On Tue, Feb 24, 2015 at 3:32 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: https://cwiki.apache.org/confluence/display/solr/SolrCloud Unfortunately, this is all 5.1 and instructs me to run the 'start from scratch' process. I wish that I could take my existing one-core no-cloud config and convert it into a cloud, 8-shard config.
Re: snapinstaller does not start newSearcher
Do you mean the snapinstaller (bash) script? Those are legacy scripts. It's been a long time since they were tested. The ReplicationHandler is the recommended way to setup replication. If you want to take a snapshot then the replication handler has an HTTP based API which lets you do that. In any case, do you have the full stack trace for that exception? There should be another cause nested under it. On Tue, Feb 24, 2015 at 12:47 PM, alx...@aim.com wrote: Hello, I am using latest solr (solr trunk) . I run snapinstaller, and see that it copies snapshot to index folder but changes are not picked up and logs in slave after running snapinstaller are 44302 [qtp1312571113-14] INFO org.apache.solr.update.UpdateHandler – start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 44303 [qtp1312571113-14] INFO org.apache.solr.update.UpdateHandler – No uncommitted changes. Skipping IW.commit. 44304 [qtp1312571113-14] INFO org.apache.solr.core.SolrCore – SolrIndexSearcher has not changed - not re-opening: org.apache.solr.search.SolrIndexSearcher 44305 [qtp1312571113-14] INFO org.apache.solr.update.UpdateHandler – end_commit_flush 44305 [qtp1312571113-14] INFO org.apache.solr.update.processor.LogUpdateProcessor – [product] webapp=/solr path=/update params={} {commit=} 0 57 Restarting solr gives Error creating core [product]: Error opening new searcher org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:873) at org.apache.solr.core.SolrCore.init(SolrCore.java:646) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) at org.apache.solr.core.SolrCore.init(SolrCore.java:845) ... 9 more Any idea what causes this issue. Thanks in advance. Alex. -- Regards, Shalin Shekhar Mangar.
Re: Geo Aggregations and Search Alerts in Solr
Hi Charlie, Thanks a lot for your response On Tue, Feb 24, 2015 at 5:08 PM, Charlie Hull char...@flax.co.uk wrote: On 24/02/2015 03:03, Richard Gibbs wrote: Hi There, I am in the process of choosing a search technology for one of my projects and I was looking into Solr and Elasticsearch. Two features that I am more interested are geo aggregations (for map clustering) and search alerts. Elasticsearch seem to have these two features built-in. http://www.elasticsearch.org/guide/en/elasticsearch/guide/ current/geo-aggs.html http://www.elasticsearch.org/guide/en/elasticsearch/ reference/current/search-percolate.html I couldn't find relevant documentation for Solr and therefore not sure whether these features are readily available in Solr. Can you please let me know whether these features are available in Solr? If not, whether there are solutions to achieve same with Solr. Hi Richard, I don't know about geo aggregations, although I know the Heliosearch guys and others have been working on various facet statistics that may impinge on this. http://heliosearch.org/solr-facet-functions/ For alerting, you're talking about storing queries and running them against any new document to see if it matches. We do this a lot for clients needing large scale media monitoring and auto-classification - here's the Lucene-based library we released: https://github.com/flaxsearch/luwak Note that this depends on a patched Lucene currently, but I'm very happy to say that a client is funding us to merge this back to trunk and we expect Luwak to be be able to a 5.x release of Lucene. More news very soon! There are a couple of videos on that page that will explain further. We suspect our approach is considerably faster than the Percolator, and it's on the list to benchmark the two. Cheers Charlie Thank you. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: highlighting the boolean query
Erick, Our default operator is AND. Both queries below parse the same: a OR (b c) OR d a OR (b AND c) OR d The parsed query: str name=parsedquery_toStringContents:a (+Contents:b +Contents:c) Contents:d/str So this part is consistent with our expectation. I'm a bit puzzled by your statement that c didn't contribute to the score. what I meant was that the term c was not hit by the scorerer: the explain section does not refer to it. I'm using the made up terms here, but the concept holds. The code suggests that we could benefit from storing term offsets and positions: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470 Is it correct assumption? On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com wrote: Highlighting is such a pain... what does the parsed query look like? If the default operator is OR, then this seems correct as both 'd' and 'c' appear in the doc. So I'm a bit puzzled by your statement that c didn't contribute to the score. If the parsed query is, indeed a +b +c d then it does look like something with the highlighter. Whether other highlighters are better for this case.. no clue ;( Best, Erick On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, nope, we are using std lucene qparser with some customizations, that do not affect the boolean query parsing logic. Should we try some other highlighter? On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com wrote: Are you using edismax? On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello! In solr 4.3.1 there seem to be some inconsistency with the highlighting of the boolean query: a OR (b c) OR d This returns a proper hit, which shows that only d was included into the document score calculation. But the highlighter returns both d and c in em tags. Is this a known issue of the standard highlighter? Can it be mitigated? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: Setting Up an External ZooKeeper Ensemble
Looks like the ZooKeeper server is either not running or not accepting connection possibly because of some configuration issues. Can you look into the ZooKeeper logs and see if there are any exceptions? On Tue, Feb 24, 2015 at 11:30 AM, CKReddy Bhimavarapu chaitu...@gmail.com wrote: Hi, I did follow all the steps in [ https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble ] but still I am getting this error bWaiting to see Solr listening on port 8983 [-] Still not seeing Solr listening on 8983 after 30 seconds!/b WARN - 2015-02-24 05:50:19.161; org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) WARN - 2015-02-24 05:50:20.262; org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect Where am I going wrong? -- ckreddybh. chaitu...@gmail.com -- Regards, Shalin Shekhar Mangar.
Solr query performace
Dear all, Hi, I was wondering is there any performance comparison available for different solr queries? I meant what is the cost of different Solr queries from memory and CPU points of view? I am looking for a report that could help me in case of having different alternatives for sending single query to Solr. Thank you very much. Best regards. -- A.Nazemian
Re: Geo Aggregations and Search Alerts in Solr
On 24/02/2015 03:03, Richard Gibbs wrote: Hi There, I am in the process of choosing a search technology for one of my projects and I was looking into Solr and Elasticsearch. Two features that I am more interested are geo aggregations (for map clustering) and search alerts. Elasticsearch seem to have these two features built-in. http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/geo-aggs.html http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html I couldn't find relevant documentation for Solr and therefore not sure whether these features are readily available in Solr. Can you please let me know whether these features are available in Solr? If not, whether there are solutions to achieve same with Solr. Hi Richard, I don't know about geo aggregations, although I know the Heliosearch guys and others have been working on various facet statistics that may impinge on this. http://heliosearch.org/solr-facet-functions/ For alerting, you're talking about storing queries and running them against any new document to see if it matches. We do this a lot for clients needing large scale media monitoring and auto-classification - here's the Lucene-based library we released: https://github.com/flaxsearch/luwak Note that this depends on a patched Lucene currently, but I'm very happy to say that a client is funding us to merge this back to trunk and we expect Luwak to be be able to a 5.x release of Lucene. More news very soon! There are a couple of videos on that page that will explain further. We suspect our approach is considerably faster than the Percolator, and it's on the list to benchmark the two. Cheers Charlie Thank you. -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Integration Tests with SOLR 5
Hi, I noticed that not only SOLR does not deliver a WAR file anymore but also advices not to try to provide a custom WAR file that can be deployed anymore as future version may depend on custom jetty features. Until 4.10. we were able to provide a WAR file with all the plug-ins we need for easier installs. The same WAR file was used together with an web application WAR running integration tests and to check if all application details still work. We used the cargo-mave2-plugin and different servlet container for testing. I think this is quiet common thing to do with continuous integration. Now I wonder if anyone has a similar setup and with integration tests running against SOLR 5. - No artifacts can be used, so no local repository cache is present - How to deploy your schema.xml, stopwords, solr plug-ins etc. for testing in an isolated environment - What does a maven boilerplate code look like? Any ideas would be appreciated. Kind regards, Thomas
Re: Is Solr best for did you mean functionality just like Google?
Solr is an IR system where Spell correction is a topping however Google has a team dedicated just for Spell corrections. Did you mean (more general term and much broader than basic Spell correctors) or Spell Correctors require a plethora of skills. I will just discuss Spell correctors here and not go into Did you mean: To start with: 1) Edit Distances (Example: In misspelt 'cax' if x is replaced by 'r' or 't' it becomes car and cat respectively which can be probable candidates for your misspelt word and now since both are at edit distance of 1 you can select the one which occurs more number of times in your solr index, however you will have to handle the cases where the misspelt word is already present in your index. Say you have misspelt token 'cax' occuring 100 times in your index ) A good spell corrector requires a lot of features on top of the above 2) Phonetics (sounds of words/metaphone etc.). 3) If you have natural language queries like The cax ran out of the house, here cat would be much more suitable spelling correction for cax as compared to car. 4) Language models play an important role. Think, what is the probability of getting an 'm' after 'e' and how does it compare with getting a 'z' after 'e' 5) Your search/http etc. logs will be a good source to improve spell corrector 6) and you can list several other You can build a physics based model by taking into account the above features for recommending the best. However rather than working hard doing the above there is always a smarter way out :), one example on that can be looking at terms in your solr index and the one's occuring the least times can be analyzed for spelling errors. Cheers, Yavar On Mon, Feb 23, 2015 at 9:53 PM, Nitin Solanki nitinml...@gmail.com wrote: Hello, I came in the worst condition. I want to do spell/query correction functionality. I have 49 GB indexed data where I have applied spellchecker. I want to do same as Google - *did you mean*. *Example* - If any user types any question/query which might be misspell or wrong typed. I need to give them suggestion like Did you mean. Is Solr best for it? Warm Regards, Nitin Solanki
Regarding behavior of docValues.
Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar
Re: Regarding behavior of docValues.
Thanks for your response Mikhail. On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Both statements seem true to me. On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com wrote: Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: How to achieve lemmatization for english words in Solr 4.10.2
If you have an English dictionary available containing words with their lemmas, you may use my patch: https://issues.apache.org/jira/browse/LUCENE-6254 This lemmatizer works with Danish, German and Norwegian dictionaries which are available for free. I'm not sure there exists a free English dictionary which is compatible, but please inform me in case there is one. There are probably useful English dictionaries for purchase. Erlend On 18.02.15 16:50, dinesh naik wrote: Hi, IS there a way to achieve lemmatization in Solr? Stemming option is not meeting the requirement.
performance issues with geofilt
Hello, we are using solr 4.10.1. There are two cores for different use cases with around 20 million documents (location descriptions) per core. Each document has a geometry field which stores a point and a bbox field which stores a bounding box. Both fields are defined with: fieldType name=t_geometry class=solr.SpatialRecursivePrefixTreeFieldType spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory geo=true distErrPct=0.025 maxDistErr=0.9 units=degrees / I'm currently trying to add a location search (find all documents around a point). My intention is to add this as filter query, so that the user is able to do an additional keyword search. These are the query parameters so far: q=*:*fq=typ:strassefq={!geofilt sfield=geometry pt=51.370570625523,12.369290471603 d=1.0} To sort the documents by their distance to the requested point, I added following sort parameter: sort={!geofilt sort=distance sfield: geometry pt=51.370570625523,12.369290471603 d=1.0} asc Unfortunately I'm experiencing here some major performance/memory problems. The first distance query on a core takes over 10 seconds. In my first setup the same request to the second core completely blocked the server and caused an OutOfMemoryError. I had to increase the memory to 16 GB and now it seems to work for the geometry field. Anyhow the first request after a server restart takes some time and when I try it with the bbox field after a requested on the geometry field in both cores, the server blocks again. Can anyone explain why the distance needs so much memory? Can this be optimized? Kind regards, Dirk
Re: Setting Up an External ZooKeeper Ensemble
yes chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh start JMX enabled by default Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh status JMX enabled by default Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg Error contacting service. It is probably not running but I don't get why it is not running On Tue, Feb 24, 2015 at 1:45 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Looks like the ZooKeeper server is either not running or not accepting connection possibly because of some configuration issues. Can you look into the ZooKeeper logs and see if there are any exceptions? On Tue, Feb 24, 2015 at 11:30 AM, CKReddy Bhimavarapu chaitu...@gmail.com wrote: Hi, I did follow all the steps in [ https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble ] but still I am getting this error bWaiting to see Solr listening on port 8983 [-] Still not seeing Solr listening on 8983 after 30 seconds!/b WARN - 2015-02-24 05:50:19.161; org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) WARN - 2015-02-24 05:50:20.262; org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect Where am I going wrong? -- ckreddybh. chaitu...@gmail.com -- Regards, Shalin Shekhar Mangar. -- ckreddybh. chaitu...@gmail.com
Re: Regarding behavior of docValues.
Both statements seem true to me. On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com wrote: Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Override freq field from custom field in Suggestions
Hello, I have a scenario where I want to use own custom field instead of freq in suggestions of each term. Custom field will be integer value and having some different value than freq in suggestion. Is it possible in Solr to use custom field instead of freq in suggestion. Your help is appreciated. Thanks and Regards, Nitin Solanki.
Re: Sorting on multi-valued field
Hi Peri, You cannot do sort on multi-valued field. It should be set to false. On Tue, Feb 24, 2015 at 8:07 PM, Peri Subrahmanya peri.subrahma...@htcinc.com wrote: All, Is there a way sorting can work on a multi-valued field or does it always have to be “false” for it to work. Thanks -Peri *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Re: [ANNOUNCE] Luke 4.10.3 released
Hi Dmitry, Thank you for the detailed clarification! Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd like to some more work and keep up to date it. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Yes, I love to the idea about having common code base. I've looked at both codes of github's (thinlet's) and Pivot's, Pivot's version has very different structure from github's (I think that is mainly for UI framework's requirement.) So it seems to be difficult to directly fork github's version to develop Pivot's version..., but I think I (or any other developers) could catch up changes in github's version. There's long way to go for Pivot's version, of course, I'd like to also make pull requests to enhance github's version if I can. Thanks, Tomoko 2015-02-24 23:34 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hi, Tomoko! Thanks for being a fan of luke! Current status of github's luke (https://github.com/DmitryKey/luke) is that it has releases for all the major lucene versions since 4.3.0, excluding 4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest -- 5.0.0. Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a long standing goal. With GWT I had issues related to listing and reading the index directory. So this effort has been parked. Most recently I have been approaching the Pivot. Mark Miller has done an initial port, that I took as the basis. I'm hoping to continue on this track as time permits. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Thanks, Dmitry On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi, I'm an user / fan of Luke, so deeply appreciate your work. I've carefully read the readme, noticed the (one of) project's goal: To port the thinlet UI to an ASL compliant license framework so that it can be contributed back to Apache Lucene. Current work is done with GWT 2.5.1. There has been GWT based, ASL compliant Luke supporting the latest Lucene ? I've recently got in with LUCENE-2562. Currently, Apache Pivot based port is going. But I do not know so much about Luke's long (and may be slightly complex) history, so I would grateful if anybody clear the association of the Luke project (now on Github) and the Jira issue. Or, they can be independent of each other. https://issues.apache.org/jira/browse/LUCENE-2562 I don't have any opinions, just want to understand current status and avoid duplicate works. Apologize for a bit annoying post. Many thanks, Tomoko 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hello, Luke 4.10.3 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3 The release has been tested against the solr-4.10.3 based index. Issues fixed in this release: #13 https://github.com/DmitryKey/luke/pull/13 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2 Thanks to respective contributors! P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for the next major release of luke. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: Sorting on multi-valued field
fwiw, open solr jira https://issues.apache.org/jira/browse/SOLR-2522 pls vote however, everything seems done at lucene level https://issues.apache.org/jira/browse/LUCENE-5454 On Tue, Feb 24, 2015 at 6:11 PM, Nitin Solanki nitinml...@gmail.com wrote: Hi Peri, You cannot do sort on multi-valued field. It should be set to false. On Tue, Feb 24, 2015 at 8:07 PM, Peri Subrahmanya peri.subrahma...@htcinc.com wrote: All, Is there a way sorting can work on a multi-valued field or does it always have to be “false” for it to work. Thanks -Peri *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: snapinstaller does not start newSearcher
Hello, We cannot use replication with the current architecture, so decided to use snapshotter with snapinstaller. Here is the full stack trace 8937 [coreLoadExecutor-5-thread-3] INFO org.apache.solr.core.CachingDirectoryFactory – Closing directory: /home/solr/solr-4.10.1/solr/example/solr/product/data 8938 [coreLoadExecutor-5-thread-3] ERROR org.apache.solr.core.CoreContainer – Error creating core [product]: Error opening new searcher org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:873) at org.apache.solr.core.SolrCore.init(SolrCore.java:646) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:491) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:255) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:249) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1565) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1677) at org.apache.solr.core.SolrCore.init(SolrCore.java:845) ... 9 more Caused by: java.nio.file.NoSuchFileException: /home/solr/solr-4.10.1/solr/example/solr/product/data/index/segments_4 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:176) at java.nio.channels.FileChannel.open(FileChannel.java:287) at java.nio.channels.FileChannel.open(FileChannel.java:334) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196) at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:198) at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:113) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:341) at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:454) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:906) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:752) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:450) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:792) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:77) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:279) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:111) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1528) ... 11 more 8943 [main] INFO org.apache.solr.servlet.SolrDispatchFilter – user.dir=/home/solr/solr-4.10.1/solr/example 8943 [main] INFO org.apache.solr.servlet.SolrDispatchFilter – SolrDispatchFilter.init() done 8982 [main] INFO org.eclipse.jetty.server.AbstractConnector – Started SocketConnector@0.0.0.0:8983 Thanks. Alex. -Original Message- From: Shalin Shekhar Mangar shalinman...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Feb 24, 2015 12:13 am Subject: Re: snapinstaller does not start newSearcher Do you mean the snapinstaller (bash) script? Those are legacy scripts. It's been a long time since they were tested. The ReplicationHandler is the recommended way to setup replication. If you want to take a snapshot then the replication handler has an HTTP based API which lets you do that. In any case, do you have the full stack trace for that exception? There should be another cause nested under it. On Tue, Feb 24, 2015 at 12:47 PM, alx...@aim.com wrote: Hello, I am using latest solr (solr trunk) . I run snapinstaller, and see that it copies snapshot to index folder but changes are not picked up and logs in slave after running snapinstaller are 44302 [qtp1312571113-14] INFO org.apache.solr.update.UpdateHandler – start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 44303 [qtp1312571113-14] INFO org.apache.solr.update.UpdateHandler – No uncommitted changes. Skipping IW.commit. 44304 [qtp1312571113-14] INFO org.apache.solr.core.SolrCore – SolrIndexSearcher has not
Re: Basic Multilingual search capability
Given the limited needs, I would probably do something like this: 1) Put a language identifier in the UpdateRequestProcessor chain during indexing and route out at least known problematic languages, such as Chinese, Japanese, Arabic into individual fields 2) Put everything else together into one field with ICUTokenizer, maybe also ICUFoldingFilter 3) At the very end of that joint filter, stick in LengthFilter with some high number, e.g. 25 characters max. This will ensure that super-long words from non-space languages and edge conditions do not break the rest of your system. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 February 2015 at 23:14, Walter Underwood wun...@wunderwood.org wrote: I understand relevancy, stemming etc becomes extremely complicated with multilingual support, but our first goal is to be able to tokenize and provide basic search capability for any language. Ex: When the document contains hello or здравствуйте, the analyzer creates tokens and provides exact match search results.
Re: Integration Tests with SOLR 5
Hi Thomas, I just downloaded solr5.0.0 tgz and found this in the directory structure: solr-5.0.0/server/webapps$ ls solr.war - How to deploy your schema.xml, stopwords, solr plug-ins etc. for testing in an isolated environment the cores, for example are created in the: solr-5.0.0/server/solr/core0$ ls conf core.properties data Same native solr directory layout for the core. If you need custom libraries (plugins etc), put them into the lib directory. Then conf/ directory is what you should be used to from before: it contains the schema.xml, solrconfig.xml etc. HTH, Dmitry On Tue, Feb 24, 2015 at 10:16 AM, Thomas Scheffler thomas.scheff...@uni-jena.de wrote: Hi, I noticed that not only SOLR does not deliver a WAR file anymore but also advices not to try to provide a custom WAR file that can be deployed anymore as future version may depend on custom jetty features. Until 4.10. we were able to provide a WAR file with all the plug-ins we need for easier installs. The same WAR file was used together with an web application WAR running integration tests and to check if all application details still work. We used the cargo-mave2-plugin and different servlet container for testing. I think this is quiet common thing to do with continuous integration. Now I wonder if anyone has a similar setup and with integration tests running against SOLR 5. - No artifacts can be used, so no local repository cache is present - How to deploy your schema.xml, stopwords, solr plug-ins etc. for testing in an isolated environment - What does a maven boilerplate code look like? Any ideas would be appreciated. Kind regards, Thomas -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: Integration Tests with SOLR 5
On 2/24/2015 1:16 AM, Thomas Scheffler wrote: I noticed that not only SOLR does not deliver a WAR file anymore but also advices not to try to provide a custom WAR file that can be deployed anymore as future version may depend on custom jetty features. Until 4.10. we were able to provide a WAR file with all the plug-ins we need for easier installs. The same WAR file was used together with an web application WAR running integration tests and to check if all application details still work. We used the cargo-mave2-plugin and different servlet container for testing. I think this is quiet common thing to do with continuous integration. Now I wonder if anyone has a similar setup and with integration tests running against SOLR 5. - No artifacts can be used, so no local repository cache is present - How to deploy your schema.xml, stopwords, solr plug-ins etc. for testing in an isolated environment - What does a maven boilerplate code look like? I don't know anything at all about Maven. For now, Solr 5.x *is* still deployed as a .war file, which you can find in the download in the server/webapps directory ... but the plan is to eventually create a standalone application for Solr instead of running Jetty. Those plans are expected to happen during the 5.x timeframe, which is why the documentation advises against relying on the .war file. It eventually *will* disappear, and it's important to prepare the userbase for that well in advance of the actual implementation. I have a custom install based on the jetty included with Solr 4.x. When I upgrade, I will continue to use the war as long as it is available, and when the standalone app appears, I will either reconfigure my init script or see about switching over to the script included with the 5.x download. Thanks, Shawn
Re: apache solr - dovecot - some search fields works some dont
What specifically do you mean by stall? Very slow but comes back? Never comes back? Throws an error? What is your field definition for body? How big is the content in it? Do you change the fields returned if you search body and if you search just headers? How many rows do you request back? One hypothesis: You are storing (stored=true) your body, it is very large and the stall happens not during search but during reading very large amount of text from disk to reconstitute the body to send it back. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 24 February 2015 at 02:06, Kevin Laurie superinterstel...@gmail.com wrote: For example if I were to search To and From apache solr would process it in its log and give me an output, however if I were to search something in the Body it would stall and no output.
How make Searching fast in spell checking
Hello all, I have 49 GB of indexed data. I am doing spell checking things. I have applied ShingleFilter on both index and query part and taking 25 suggestions of each word in the query and not using collations. When I search a phrase(taken 5-6 words. Ex.- barack obama is president of America) then it takes 2 to 3 seconds to process while searching a single term(Ex. - barack) then it takes only 0.23 second which is good. Why phrase checking is taking time. Am I doing something wrong ? Any help on this?
Re: [ANNOUNCE] Luke 4.10.3 released
Hi, Tomoko! Thanks for being a fan of luke! Current status of github's luke (https://github.com/DmitryKey/luke) is that it has releases for all the major lucene versions since 4.3.0, excluding 4.4.0 (luke 4.5.0 should be able open indices of 4.4.0) and the latest -- 5.0.0. Porting the github's luke to ALv2 compliant framework (GWT or Pivot) is a long standing goal. With GWT I had issues related to listing and reading the index directory. So this effort has been parked. Most recently I have been approaching the Pivot. Mark Miller has done an initial port, that I took as the basis. I'm hoping to continue on this track as time permits. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache, but at least we will have the common plate. :) Thanks, Dmitry On Tue, Feb 24, 2015 at 4:02 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi, I'm an user / fan of Luke, so deeply appreciate your work. I've carefully read the readme, noticed the (one of) project's goal: To port the thinlet UI to an ASL compliant license framework so that it can be contributed back to Apache Lucene. Current work is done with GWT 2.5.1. There has been GWT based, ASL compliant Luke supporting the latest Lucene ? I've recently got in with LUCENE-2562. Currently, Apache Pivot based port is going. But I do not know so much about Luke's long (and may be slightly complex) history, so I would grateful if anybody clear the association of the Luke project (now on Github) and the Jira issue. Or, they can be independent of each other. https://issues.apache.org/jira/browse/LUCENE-2562 I don't have any opinions, just want to understand current status and avoid duplicate works. Apologize for a bit annoying post. Many thanks, Tomoko 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hello, Luke 4.10.3 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3 The release has been tested against the solr-4.10.3 based index. Issues fixed in this release: #13 https://github.com/DmitryKey/luke/pull/13 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2 Thanks to respective contributors! P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for the next major release of luke. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Sorting on multi-valued field
All, Is there a way sorting can work on a multi-valued field or does it always have to be “false” for it to work. Thanks -Peri *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Re: [ANNOUNCE] Luke 4.10.3 released
Hi, I'm an user / fan of Luke, so deeply appreciate your work. I've carefully read the readme, noticed the (one of) project's goal: To port the thinlet UI to an ASL compliant license framework so that it can be contributed back to Apache Lucene. Current work is done with GWT 2.5.1. There has been GWT based, ASL compliant Luke supporting the latest Lucene ? I've recently got in with LUCENE-2562. Currently, Apache Pivot based port is going. But I do not know so much about Luke's long (and may be slightly complex) history, so I would grateful if anybody clear the association of the Luke project (now on Github) and the Jira issue. Or, they can be independent of each other. https://issues.apache.org/jira/browse/LUCENE-2562 I don't have any opinions, just want to understand current status and avoid duplicate works. Apologize for a bit annoying post. Many thanks, Tomoko 2015-02-24 0:00 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hello, Luke 4.10.3 has been released. Download it here: https://github.com/DmitryKey/luke/releases/tag/luke-4.10.3 The release has been tested against the solr-4.10.3 based index. Issues fixed in this release: #13 https://github.com/DmitryKey/luke/pull/13 Apache License 2.0 abbreviation changed from ASL 2.0 to ALv2 Thanks to respective contributors! P.S. waiting for lucene 5.0 artifacts to hit public maven repositories for the next major release of luke. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Dynamic boosting on a document for Solr4.10.2
Hi , We are looking for an option to boost a document while indexing based on the values of certain field. For example : lets say we have 10 documents with fields say- name,acc no, status, age address etc. Now for documents with status 'Active' we want to boost by value 1000 and if status is 'Closed' we want to do negative boost say -100 . Also if age is between '20-50' we want to boost by 2000 etc. Please let us know how can we achieve this ? -- Best Regards, Dinesh Naik
Re: Sorting on multi-valued field
The usual strategy is to have an UpdateRequestProcessor chain that will copy the field and keep only one value from it, specifically for sort. There is a whole collection of URPs to help you choose which value to keep, as well as how to provide a default. You can see the full list at: http://www.solr-start.com/info/update-request-processors/#FieldValueSubsetUpdateProcessorFactory Also, if you are on the recent Solr, consider enabling docValues on that target single-value field, it's better for sorting. You can have other flags (stored,indexed) set to false, as you will not be using the field for anything else. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 February 2015 at 09:37, Peri Subrahmanya peri.subrahma...@htcinc.com wrote: All, Is there a way sorting can work on a multi-valued field or does it always have to be “false” for it to work. Thanks -Peri *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Re: Special character and wildcard matching
Please post the info I requested - the exact query, and the Solr response. -- Jack Krupansky On Tue, Feb 24, 2015 at 12:45 PM, Arun Rangarajan arunrangara...@gmail.com wrote: In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?
Re: highlighting the boolean query
BooleanQuery’s extractTerms looks like this: public void extractTerms(SetTerm terms) { for (BooleanClause clause : clauses) { if (clause.isProhibited() == false) { clause.getQuery().extractTerms(terms); } } } that’s generally the method called by the Highlighter for what terms should be highlighted. So even if a term didn’t match the document, the query that the term was in matched the document and it just blindly highlights all the terms (minus prohibited ones). That at least explains the behavior you’re seeing, but it’s not ideal. I’ve seen specialized highlighters that convert to spans, which are accurate to the exact matches within the document. Been a while since I dug into the HighlightComponent, so maybe there’s some other options available out of the box? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Feb 24, 2015, at 3:16 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, Our default operator is AND. Both queries below parse the same: a OR (b c) OR d a OR (b AND c) OR d The parsed query: str name=parsedquery_toStringContents:a (+Contents:b +Contents:c) Contents:d/str So this part is consistent with our expectation. I'm a bit puzzled by your statement that c didn't contribute to the score. what I meant was that the term c was not hit by the scorerer: the explain section does not refer to it. I'm using the made up terms here, but the concept holds. The code suggests that we could benefit from storing term offsets and positions: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470 Is it correct assumption? On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com wrote: Highlighting is such a pain... what does the parsed query look like? If the default operator is OR, then this seems correct as both 'd' and 'c' appear in the doc. So I'm a bit puzzled by your statement that c didn't contribute to the score. If the parsed query is, indeed a +b +c d then it does look like something with the highlighter. Whether other highlighters are better for this case.. no clue ;( Best, Erick On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, nope, we are using std lucene qparser with some customizations, that do not affect the boolean query parsing logic. Should we try some other highlighter? On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com wrote: Are you using edismax? On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello! In solr 4.3.1 there seem to be some inconsistency with the highlighting of the boolean query: a OR (b c) OR d This returns a proper hit, which shows that only d was included into the document score calculation. But the highlighter returns both d and c in em tags. Is this a known issue of the standard highlighter? Can it be mitigated? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: apache solr - dovecot - some search fields works some dont
Dear Alex, Nothing comes back when I do a body search. It shows a searching process on the client but then it just stops and no result comes up. I am wondering if this is schema related problem. When I search a subject on the mail client I get output as below and :- 8025 [main] INFO org.eclipse.jetty.server.AbstractConnector ? Started SocketConnector@0.0.0.0:8983 9001 [searcherExecutor-6-thread-1] INFO org.apache.solr.core.SolrCore ? [collection1] Registered new searcher Searcher@7dfcb28[collection1] main{StandardDirectoryReader(segments_4g:789:nrt _6z(4.10.2):C16672 _44(4.10.2):C6996 _56(4.10.2):C3672 _64(4.10.2):C4000 _8y(4.10.2):C3143 _7v(4.10.2):C673 _7b(4.10.2):C830 _85(4.10.2):C3754 _7k(4.10.2):C3975 _8f(4.10.2):C1516 _7n(4.10.2):C67 _9a(4.10.2):C677 _8o(4.10.2):C38 _8v(4.10.2):C40 _9l(4.10.2):C2705 _8x(4.10.2):C43 _90(4.10.2):C16 _9b(4.10.2):C22 _9d(4.10.2):C44 _9f(4.10.2):C84 _9h(4.10.2):C83 _9i(4.10.2):C356 _9j(4.10.2):C84 _9k(4.10.2):C296 _9m(4.10.2):C83 _9n(4.10.2):C57)} 155092 [qtp433527567-13] INFO org.apache.solr.core.SolrCore ? [collection1] webapp=/solr path=/select params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:u...@domain.netrows=107178} hits=1237 status=0 QTime=1918 The content is quite large, 27,000emails . Could you advise what could this problem be? How do we correct and fix this problem then? I might have the wrong schema installed so the body search is not working. Could this be it? Might post this on dovecot to see if someone could answer about this. Kindly advise if you have any idea on this Ps.How do I check the body definition? Thanks Kevin On Tue, Feb 24, 2015 at 9:36 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: What specifically do you mean by stall? Very slow but comes back? Never comes back? Throws an error? What is your field definition for body? How big is the content in it? Do you change the fields returned if you search body and if you search just headers? How many rows do you request back? One hypothesis: You are storing (stored=true) your body, it is very large and the stall happens not during search but during reading very large amount of text from disk to reconstitute the body to send it back. Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 24 February 2015 at 02:06, Kevin Laurie superinterstel...@gmail.com wrote: For example if I were to search To and From apache solr would process it in its log and give me an output, however if I were to search something in the Body it would stall and no output.
Re: apache solr - dovecot - some search fields works some dont
Look for the line like this in your log with the search matching the body. Maybe put a nonsense string and look for that. This should tell you what the Solr-side search looks like. The thing that worries me here is: rows=107178 - that's most probably what's blowing up Solr. You should be paging, not getting everything. And that number being like that, it may mean your client makes two requests, once to get the result count and once to get the rows themselves. It's the second request that is most probably blowing up. Once you get the request, you should be able to tell what fields are being searched and check those fields in schema.xml for field type and then field type's definition. Which is what I asked for in the previous email. Regards, Alex. On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com wrote: 155092 [qtp433527567-13] INFO org.apache.solr.core.SolrCore ? [collection1] webapp=/solr path=/select params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:u...@domain.netrows=107178} hits=1237 status=0 QTime=1918 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: apache solr - dovecot - some search fields works some dont
Dear Alex, I checked the log. When searching the fields From , To, Subject. It records it When searching Body, there is no log showing. I am assuming it is a problem in the schema. Will post schema.xml output in next mail. On Wed, Feb 25, 2015 at 1:09 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Look for the line like this in your log with the search matching the body. Maybe put a nonsense string and look for that. This should tell you what the Solr-side search looks like. The thing that worries me here is: rows=107178 - that's most probably what's blowing up Solr. You should be paging, not getting everything. And that number being like that, it may mean your client makes two requests, once to get the result count and once to get the rows themselves. It's the second request that is most probably blowing up. Once you get the request, you should be able to tell what fields are being searched and check those fields in schema.xml for field type and then field type's definition. Which is what I asked for in the previous email. Regards, Alex. On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com wrote: 155092 [qtp433527567-13] INFO org.apache.solr.core.SolrCore ? [collection1] webapp=/solr path=/select params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser: u...@domain.netrows=107178} hits=1237 status=0 QTime=1918 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: apache solr - dovecot - some search fields works some dont
Hi Alex, Sorry for such noobness question. But where does the schema file go in Solr? Is the directory below correct? /opt/solr/solr/collection1/data Correct? Thanks Kevin On Wed, Feb 25, 2015 at 1:21 AM, Kevin Laurie superinterstel...@gmail.com wrote: Dear Alex, I checked the log. When searching the fields From , To, Subject. It records it When searching Body, there is no log showing. I am assuming it is a problem in the schema. Will post schema.xml output in next mail. On Wed, Feb 25, 2015 at 1:09 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Look for the line like this in your log with the search matching the body. Maybe put a nonsense string and look for that. This should tell you what the Solr-side search looks like. The thing that worries me here is: rows=107178 - that's most probably what's blowing up Solr. You should be paging, not getting everything. And that number being like that, it may mean your client makes two requests, once to get the result count and once to get the rows themselves. It's the second request that is most probably blowing up. Once you get the request, you should be able to tell what fields are being searched and check those fields in schema.xml for field type and then field type's definition. Which is what I asked for in the previous email. Regards, Alex. On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com wrote: 155092 [qtp433527567-13] INFO org.apache.solr.core.SolrCore ? [collection1] webapp=/solr path=/select params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser:u...@domain.netrows=107178} hits=1237 status=0 QTime=1918 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: highlighting the boolean query
Hmmm, not quite sure what to say. Offsets and positions help, particularly with FastVectorHighlighter, but the highlighting is usually re-analyzed anyway so it _shouldn't_ matter. But what I don't know about highlighting could fill volumes ;).. Sorry I can't be more help here. Erick On Tue, Feb 24, 2015 at 12:16 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, Our default operator is AND. Both queries below parse the same: a OR (b c) OR d a OR (b AND c) OR d The parsed query: str name=parsedquery_toStringContents:a (+Contents:b +Contents:c) Contents:d/str So this part is consistent with our expectation. I'm a bit puzzled by your statement that c didn't contribute to the score. what I meant was that the term c was not hit by the scorerer: the explain section does not refer to it. I'm using the made up terms here, but the concept holds. The code suggests that we could benefit from storing term offsets and positions: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470 Is it correct assumption? On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com wrote: Highlighting is such a pain... what does the parsed query look like? If the default operator is OR, then this seems correct as both 'd' and 'c' appear in the doc. So I'm a bit puzzled by your statement that c didn't contribute to the score. If the parsed query is, indeed a +b +c d then it does look like something with the highlighter. Whether other highlighters are better for this case.. no clue ;( Best, Erick On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, nope, we are using std lucene qparser with some customizations, that do not affect the boolean query parsing logic. Should we try some other highlighter? On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com wrote: Are you using edismax? On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello! In solr 4.3.1 there seem to be some inconsistency with the highlighting of the boolean query: a OR (b c) OR d This returns a proper hit, which shows that only d was included into the document score calculation. But the highlighter returns both d and c in em tags. Is this a known issue of the standard highlighter? Can it be mitigated? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Re: Query: no result returned if use AND OR operators
You're probably hitting different request handlers. From the fragment you posted, the one that returns 8 is going to the /browse handler (see solrconfig.xml). The admin UI goes to either /select or /query. These are configured totally differently in terms of what fields are searched etc. Attach debug=query to the URL and look in Velocity for the debug info and you'll see that that parsed query is significantly different. At least that's my guess. Best, Erick On Mon, Feb 23, 2015 at 11:01 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, My Solr is 4.10.2 When I use the web UI to run a simple query: 1+AND+2 1) from the log, I can see the hits=8 7629109 [qtp1702388274-16] INFO org.apache.solr.core.SolrCore – [infocast] webapp=/solr path=/clustering params={q=1+AND+2wt=velocityv.template=cluster_results} hits=8 status=0 QTime=21 However, from the query page, it returns 2) 0 results found in 5 ms Page 0 of 0 0 results found. Page 0 of 0 3) If I use Admin page to ruyn the query, I can get 3 back { responseHeader: { status: 0, QTime: 5, params: { indent: true, q: \1\ AND \2\, _: 1424761089223, wt: json } }, response: { numFound: 3, start: 0, docs: [ { title: [ …. Very strange to me, please help! Regards
Re: apache solr - dovecot - some search fields works some dont
Hi Alex, Below is where my schema is stored:- /opt/solr/solr/collection1/conf# File name: schema.xml Below output for body fields field name=id type=string indexed=true stored=true required=true / field name=uid type=slong indexed=true stored=true required=true / field name=box type=string indexed=true stored=true required=true / field name=user type=string indexed=true stored=true required=true / field name=hdr type=text indexed=true stored=false / field name=body type=text indexed=true stored=false / field name=from type=text indexed=true stored=false / field name=to type=text indexed=true stored=false / field name=cc type=text indexed=true stored=false / field name=bcc type=text indexed=true stored=false / field name=subject type=text indexed=true stored=false / /fields Anything you see that I should be concerned about? On Wed, Feb 25, 2015 at 1:27 AM, Kevin Laurie superinterstel...@gmail.com wrote: Hi Alex, Sorry for such noobness question. But where does the schema file go in Solr? Is the directory below correct? /opt/solr/solr/collection1/data Correct? Thanks Kevin On Wed, Feb 25, 2015 at 1:21 AM, Kevin Laurie superinterstel...@gmail.com wrote: Dear Alex, I checked the log. When searching the fields From , To, Subject. It records it When searching Body, there is no log showing. I am assuming it is a problem in the schema. Will post schema.xml output in next mail. On Wed, Feb 25, 2015 at 1:09 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Look for the line like this in your log with the search matching the body. Maybe put a nonsense string and look for that. This should tell you what the Solr-side search looks like. The thing that worries me here is: rows=107178 - that's most probably what's blowing up Solr. You should be paging, not getting everything. And that number being like that, it may mean your client makes two requests, once to get the result count and once to get the rows themselves. It's the second request that is most probably blowing up. Once you get the request, you should be able to tell what fields are being searched and check those fields in schema.xml for field type and then field type's definition. Which is what I asked for in the previous email. Regards, Alex. On 24 February 2015 at 11:55, Kevin Laurie superinterstel...@gmail.com wrote: 155092 [qtp433527567-13] INFO org.apache.solr.core.SolrCore ? [collection1] webapp=/solr path=/select params={sort=uid+ascfl=uid,scoreq=subject:pricefq=%2Bbox:ac553604f7314b54e6233555fc1a+%2Buser: u...@domain.netrows=107178} hits=1237 status=0 QTime=1918 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: Special character and wildcard matching
In our case, the lower-casing is happening in a custom Java indexer code, via Java's String.toLowerCase() method. I used the analysis tool in Solr admin (with Jetty). I believe the raw bytes explain this. Attached are the results for beyonce in file beyonce_no_spl_chars.JPG and beyoncé in file beyonce_with_spl_chars.JPG. Raw bytes for beyonce: [62 65 79 6f 6e 63 65] Raw bytes for beyoncé:[62 65 79 6f 6e 63 65 cc 81] So when you look at the bytes, it seems to explain why beyonce* matches beyoncé. I tried your approach with a KeywordTokenizer followed by a LowerCaseFilter, but I see the same behavior. On Mon, Feb 23, 2015 at 5:16 PM, Jack Krupansky jack.krupan...@gmail.com wrote: But how is that lowercasing occurring? I mean, solr.StrField doesn't do that. Some containers default to automatically mapping accented characters, so that the accented e would then get indexed as a normal e, and then your wildcard would match it, and an accented e in a query would get mapped as well and then match the normal e in the index. What does your query response look like? This blog post explains that problem: http://bensch.be/tomcat-solr-and-special-characters Note that you could make your string field a text field with the keyword tokenizer and then filter it for lower case, such as when the user query might have a capital B. String field is most appropriate when the field really is 100% raw. -- Jack Krupansky On Mon, Feb 23, 2015 at 7:37 PM, Arun Rangarajan arunrangara...@gmail.com wrote: Yes, it is a string field and not a text field. fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ field name=raw_name type=string indexed=true stored=true / Lower-casing done to do case-insensitive matching. On Mon, Feb 23, 2015 at 4:01 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Is it really a string field - as opposed to a text field? Show us the field and field type. Besides, if it really were a raw name, wouldn't that be a capital B? -- Jack Krupansky On Mon, Feb 23, 2015 at 6:52 PM, Arun Rangarajan arunrangara...@gmail.com wrote: I have a string field raw_name like this in my document: {raw_name: beyoncé} (Notice that the last character is a special character.) When I issue this wildcard query: q=raw_name:beyonce* i.e. with the last character simply being the ASCII 'e', Solr returns me the above document. How do I prevent this?
Re: Regarding behavior of docValues.
Hmmm, that's not my understanding. docValues are simply a different layout for storing the _indexed_ values that facilitates rapid loading of the field from disk, essentially putting the uninverted field value in a conveniently-loadable form. So AFAIK, the field is stored only once and used for all three, sorting, faceting and searching. Best, Erick On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com wrote: Thanks for your response Mikhail. On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Both statements seem true to me. On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com wrote: Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: rankquery usage bug?
Ticket filed, thanks! https://issues.apache.org/jira/browse/SOLR-7152 On Fri, Feb 20, 2015 at 9:29 PM, Joel Bernstein joels...@gmail.com wrote: Ryan, This looks like a good jira ticket to me. Joel Bernstein Search Engineer at Heliosearch On Fri, Feb 20, 2015 at 6:40 PM, Ryan Josal rjo...@gmail.com wrote: Hey guys, I put a rq in defaults but I can't figure out how to override it with no rankquery. Looks like one option might be checking for empty string before trying to use it in QueryComponent? I can work around it in the prep method of an earlier searchcomponent for now. Ryan
Re: 8 Shards of Cloud with 4.10.3.
Benson: Are you trying to run independent invocations of Solr for every node? Otherwise, you'd just want to create a 8 shard collection with maxShardsPerNode set to 8 (or more I guess). Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Tue, Feb 24, 2015 at 1:27 PM, Benson Margulies bimargul...@gmail.com wrote: With so much of the site shifted to 5.0, I'm having a bit of trouble finding what I need, and so I'm hoping that someone can give me a push in the right direction. On a big multi-core machine, I want to set up a configuration with 8 (or perhaps more) nodes treated as shards. I have some very particular solrconfig.xml and schema.xml that I need to use. Could some kind person point me at a relatively step-by-step layout? This is all on Linux, I'm happy to explicitly run Zookeeper.
Re: highlighting the boolean query
There is also PostingsHighlighter -- I recommend it, if only for the performance improvement, which is substantial, but I'm not completely sure how it handles this issue. The one drawback I *am* aware of is that it is insensitive to positions (so words from phrases get highlighted even in isolation) -Mike On 02/24/2015 12:46 PM, Erik Hatcher wrote: BooleanQuery’s extractTerms looks like this: public void extractTerms(SetTerm terms) { for (BooleanClause clause : clauses) { if (clause.isProhibited() == false) { clause.getQuery().extractTerms(terms); } } } that’s generally the method called by the Highlighter for what terms should be highlighted. So even if a term didn’t match the document, the query that the term was in matched the document and it just blindly highlights all the terms (minus prohibited ones). That at least explains the behavior you’re seeing, but it’s not ideal. I’ve seen specialized highlighters that convert to spans, which are accurate to the exact matches within the document. Been a while since I dug into the HighlightComponent, so maybe there’s some other options available out of the box? — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Feb 24, 2015, at 3:16 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, Our default operator is AND. Both queries below parse the same: a OR (b c) OR d a OR (b AND c) OR d The parsed query: str name=parsedquery_toStringContents:a (+Contents:b +Contents:c) Contents:d/str So this part is consistent with our expectation. I'm a bit puzzled by your statement that c didn't contribute to the score. what I meant was that the term c was not hit by the scorerer: the explain section does not refer to it. I'm using the made up terms here, but the concept holds. The code suggests that we could benefit from storing term offsets and positions: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470 Is it correct assumption? On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson erickerick...@gmail.com wrote: Highlighting is such a pain... what does the parsed query look like? If the default operator is OR, then this seems correct as both 'd' and 'c' appear in the doc. So I'm a bit puzzled by your statement that c didn't contribute to the score. If the parsed query is, indeed a +b +c d then it does look like something with the highlighter. Whether other highlighters are better for this case.. no clue ;( Best, Erick On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan solrexp...@gmail.com wrote: Erick, nope, we are using std lucene qparser with some customizations, that do not affect the boolean query parsing logic. Should we try some other highlighter? On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com wrote: Are you using edismax? On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan solrexp...@gmail.com wrote: Hello! In solr 4.3.1 there seem to be some inconsistency with the highlighting of the boolean query: a OR (b c) OR d This returns a proper hit, which shows that only d was included into the document score calculation. But the highlighter returns both d and c in em tags. Is this a known issue of the standard highlighter? Can it be mitigated? -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
8 Shards of Cloud with 4.10.3.
With so much of the site shifted to 5.0, I'm having a bit of trouble finding what I need, and so I'm hoping that someone can give me a push in the right direction. On a big multi-core machine, I want to set up a configuration with 8 (or perhaps more) nodes treated as shards. I have some very particular solrconfig.xml and schema.xml that I need to use. Could some kind person point me at a relatively step-by-step layout? This is all on Linux, I'm happy to explicitly run Zookeeper.
Re: apache solr - dovecot - some search fields works some dont
The field definition looks fine. It's not storing any content (stored=false) but is indexing, so you should find the records but not see the body in them. Not seeing a log entry is more of a worry. Are you sure the request even made it to Solr? Can you see anything in Dovecot's logs? Or in Solr's access.logs (Actually Jetty/Tomcat's access logs that may need to be enabled first). At this point, you don't have enough information to fix anything. You need to understand what's different between request against subject vs. the request against body. I would break the communication in three stages: 1) What Dovecote sent 2) What Solr received 3) What Solr sent back I don't know your skill levels or your system setup to advise specifically, but Network tracer (e.g. Wireshark) is good for 1. Logs are good for 2. Using the query from 1) and manually running it against Solr is good for 3). Hope this helps, Alex. On 24 February 2015 at 12:35, Kevin Laurie superinterstel...@gmail.com wrote: field name=body type=text indexed=true stored=false / Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
Re: Sorting on multi-valued field
How about creating two fields for the multi-valued field. First, grab the higher and lower values of the multi-valued field by using natural sort order. Then use the first field to store the highest order value. Use second field to store lowest order value. Both these fields are single valued. Now based on the sort order of the original field, override the sort field in the handler side before executing the query. Thanks Shyamsunder Sent from my iPhone On Feb 24, 2015, at 11:28 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: The usual strategy is to have an UpdateRequestProcessor chain that will copy the field and keep only one value from it, specifically for sort. There is a whole collection of URPs to help you choose which value to keep, as well as how to provide a default. You can see the full list at: http://www.solr-start.com/info/update-request-processors/#FieldValueSubsetUpdateProcessorFactory Also, if you are on the recent Solr, consider enabling docValues on that target single-value field, it's better for sorting. You can have other flags (stored,indexed) set to false, as you will not be using the field for anything else. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 February 2015 at 09:37, Peri Subrahmanya peri.subrahma...@htcinc.com wrote: All, Is there a way sorting can work on a multi-valued field or does it always have to be “false” for it to work. Thanks -Peri *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Re: Solrcloud performance issues
why you use 15 replicas? more replicas more slower. -- View this message in context: http://lucene.472066.n3.nabble.com/Solrcloud-performance-issues-tp4186035p4188738.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regarding behavior of docValues.
You're making it too complicated. Both a docValues field and an indexed (not docValues) field will give you the same functionality. For rapidly changing indexes, docValues will load more quickly when a new searcher is opened. Your question below is not really relevant. Can it be *field name=manu_exact type=string indexed=true stored=false docValues=true /* or Two fields each for sorting+faceting and for searching like following. *field name=manu_exact type=string indexed=true stored=false /field name=manu_exact_sort type=string indexed=false stored=false docValues=true /* * You simply cannot sort, search, or facet on any field for which indexed=false. You can do all three on any field where indexed=true (assuming it's not multiValued and only has one token since sorting only really makes sense for single-valued fields). It doesn't matter whether the field is docValues=true or not. So if you want a rule of thumb, make it a docValues field if you're updating your index rapidly. Otherwise whether a field is docValues or not is largely irrelevant. Best, Erick On Tue, Feb 24, 2015 at 9:09 PM, Modassar Ather modather1...@gmail.com wrote: So for a requirement where I have a field which is used for sorting, faceting and searching what should be the better field definition. Can it be *field name=manu_exact type=string indexed=true stored=false docValues=true /* or Two fields each for sorting+faceting and for searching like following. *field name=manu_exact type=string indexed=true stored=false /field name=manu_exact_sort type=string indexed=false stored=false docValues=true /* Kindly note that it will be better if can use existing field for sorting, faceting and add searching on it like in example one above. Regards, Modassar On Tue, Feb 24, 2015 at 11:15 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's not my understanding. docValues are simply a different layout for storing the _indexed_ values that facilitates rapid loading of the field from disk, essentially putting the uninverted field value in a conveniently-loadable form. So AFAIK, the field is stored only once and used for all three, sorting, faceting and searching. Best, Erick On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com wrote: Thanks for your response Mikhail. On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Both statements seem true to me. On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com wrote: Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
New leader/replica solution for HDFS
We used HDFS as our Solr index storage and we really have a heavy update load. We had met much problems with current leader/replica solution. There is duplicate index computing on Replilca side. And the data sync between leader/replica is always a problem. As HDFS already provides data replication on data layer, could Solr provide just service layer replication? My thought is that the leader and the replica all bind to the same data index directory. And the leader will build up index for new request, the replica will just keep update the index version with the leader(such as a soft commit periodically? ). If the leader lost then the replica will take the duty immediately. Thanks for any suggestion of this idea. -- View this message in context: http://lucene.472066.n3.nabble.com/New-leader-replica-solution-for-HDFS-tp4188735.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regarding behavior of docValues.
Thanks Erick for your detailed response. Sorry! I missed to put that I was trying to understand it in context of Solr-5.0.0 where fieldcache is no more available. Regards, Modassar On Wed, Feb 25, 2015 at 11:26 AM, Erick Erickson erickerick...@gmail.com wrote: You're making it too complicated. Both a docValues field and an indexed (not docValues) field will give you the same functionality. For rapidly changing indexes, docValues will load more quickly when a new searcher is opened. Your question below is not really relevant. Can it be *field name=manu_exact type=string indexed=true stored=false docValues=true /* or Two fields each for sorting+faceting and for searching like following. *field name=manu_exact type=string indexed=true stored=false /field name=manu_exact_sort type=string indexed=false stored=false docValues=true /* * You simply cannot sort, search, or facet on any field for which indexed=false. You can do all three on any field where indexed=true (assuming it's not multiValued and only has one token since sorting only really makes sense for single-valued fields). It doesn't matter whether the field is docValues=true or not. So if you want a rule of thumb, make it a docValues field if you're updating your index rapidly. Otherwise whether a field is docValues or not is largely irrelevant. Best, Erick On Tue, Feb 24, 2015 at 9:09 PM, Modassar Ather modather1...@gmail.com wrote: So for a requirement where I have a field which is used for sorting, faceting and searching what should be the better field definition. Can it be *field name=manu_exact type=string indexed=true stored=false docValues=true /* or Two fields each for sorting+faceting and for searching like following. *field name=manu_exact type=string indexed=true stored=false /field name=manu_exact_sort type=string indexed=false stored=false docValues=true /* Kindly note that it will be better if can use existing field for sorting, faceting and add searching on it like in example one above. Regards, Modassar On Tue, Feb 24, 2015 at 11:15 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's not my understanding. docValues are simply a different layout for storing the _indexed_ values that facilitates rapid loading of the field from disk, essentially putting the uninverted field value in a conveniently-loadable form. So AFAIK, the field is stored only once and used for all three, sorting, faceting and searching. Best, Erick On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com wrote: Thanks for your response Mikhail. On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Both statements seem true to me. On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com wrote: Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Setting Up an External ZooKeeper Ensemble
update: everything is working fine after I downloaded new zookeeper and set same configuration. thanks for helping. On Tue, Feb 24, 2015 at 5:13 PM, CKReddy Bhimavarapu chaitu...@gmail.com wrote: yes chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh start JMX enabled by default Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg Starting zookeeper ... STARTED chaitanya@imart-desktop:~/solr/zookeeper-3.4.6/bin$ ./zkServer.sh status JMX enabled by default Using config: /home/chaitanya/solr/zookeeper-3.4.6/bin/../conf/zoo.cfg Error contacting service. It is probably not running but I don't get why it is not running On Tue, Feb 24, 2015 at 1:45 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Looks like the ZooKeeper server is either not running or not accepting connection possibly because of some configuration issues. Can you look into the ZooKeeper logs and see if there are any exceptions? On Tue, Feb 24, 2015 at 11:30 AM, CKReddy Bhimavarapu chaitu...@gmail.com wrote: Hi, I did follow all the steps in [ https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble ] but still I am getting this error bWaiting to see Solr listening on port 8983 [-] Still not seeing Solr listening on 8983 after 30 seconds!/b WARN - 2015-02-24 05:50:19.161; org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) WARN - 2015-02-24 05:50:20.262; org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect Where am I going wrong? -- ckreddybh. chaitu...@gmail.com -- Regards, Shalin Shekhar Mangar. -- ckreddybh. chaitu...@gmail.com -- ckreddybh. chaitu...@gmail.com
Solr Document expiration with TTL
Hello, We are trying to add documents in solr with ttl defined(document expiration feature), which is expected to expire at specified time, but it is not. Following are the settings we have defined in solrconfig.xml and managed-schema. solr version : 5.0.0 *solrconfig.xml* --- updateRequestProcessorChain default=true processor class=solr.processor.DocExpirationUpdateProcessorFactory int name=autoDeletePeriodSeconds30/int str name=ttlFieldNametime_to_live_s/str str name=expirationFieldNameexpire_at_dt/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain *managed-schema* --- field name=id type=string indexed=true stored=true multiValued=false / field name=time_to_live_s type=string stored=true multiValued=false / field name=expire_at_dt type=date stored=true multiValued=false / *solr query* Following query posts a document and sets expire_at_dt explicitly. That is working perfectly ok and ducument expires at defined time. curl -X POST -H 'Content-Type: application/json' ' http://localhost:8983/solr/collection1/update?commit=true' -d '[{ id:10seconds,expire_at_dt:NOW+10SECONDS}]' But when trying to post with TTL (following query), document does not expire after given time. curl -X POST -H 'Content-Type: application/json' ' http://localhost:8983/solr/collection1/update?commit=true' -d '[{ id:10seconds,time_to_live_s:+10SECONDS}]' Any help would be appreciated. Thanks, Makailol
Re: how to debug solr performance degradation
On 2/24/2015 5:45 PM, Tang, Rebecca wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? I would like to know what memory numbers in which program you are looking at, and why you believe those numbers are a problem. The JVM has a very different view of memory than the operating system. Numbers in top mean different things than numbers on the dashboard of the admin UI, or the numbers in jconsole. If you're on Windows, then replace top with task manager, process explorer, resource monitor, etc. Please provide as many details as you can about the things you are looking at. Thanks, Shawn
Re: [ANNOUNCE] Apache Solr 5.0.0 and Reference Guide for Solr 5.0 released
Awesome news. Thanks. *Sebastián Ramírez* Diseñador de Algoritmos http://www.senseta.com Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo https://twitter.com/tiangolo Email: sebastian.rami...@senseta.com www.senseta.com On Fri, Feb 20, 2015 at 3:55 PM, Anshum Gupta ans...@anshumgupta.net wrote: 20 February 2015, Apache Solr™ 5.0.0 and Reference Guide for Solr 5.0 available The Lucene PMC is pleased to announce the release of Apache Solr 5.0.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 5.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 5.0 Release Highlights: * Usability improvements that include improved bin scripts and new and restructured examples. * Scripts to support installing and running Solr as a service on Linux. * Distributed IDF is now supported and can be enabled via the config. Currently, there are four supported implementations for the same: * LocalStatsCache: Local document stats. * ExactStatsCache: One time use aggregation * ExactSharedStatsCache: Stats shared across requests * LRUStatsCache: Stats shared in an LRU cache across requests * Solr will no longer ship a war file and instead be a downloadable application. * SolrJ now has first class support for Collections API. * Implicit registration of replication,get and admin handlers. * Config API that supports paramsets for easily configuring solr parameters and configuring fields. This API also supports managing of pre-existing request handlers and editing common solrconfig.xml via overlay. * API for managing blobs allows uploading request handler jars and registering them via config API. * BALANCESHARDUNIQUE Collection API that allows for even distribution of custom replica properties. * There's now an option to not shuffle the nodeSet provided during collection creation. * Option to configure bandwidth usage by Replication handler to prevent it from using up all the bandwidth. * Splitting of clusterstate to per-collection enables scalability improvement in SolrCloud. This is also the default format for new Collections that would be created going forward. * timeAllowed is now used to prematurely terminate requests during query expansion and SolrClient request retry. * pivot.facet results can now include nested stats.field results constrained by those pivots. * stats.field can be used to generate stats over the results of arbitrary numeric functions. It also allows for requesting for statistics for pivot facets using tags. * A new DateRangeField has been added for indexing date ranges, especially multi-valued ones. * Spatial fields that used to require units=degrees now take distanceUnits=degrees/kilometers miles instead. * MoreLikeThis query parser allows requesting for documents similar to an existing document and also works in SolrCloud mode. * Logging improvements: * Transaction log replay status is now logged * Optional logging of slow requests. Solr 5.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Detailed change log: http://lucene.apache.org/solr/5_0_0/changes/Changes.html Also available is the *Solr Reference Guide for Solr 5.0*. This 535 page PDF serves as the definitive user's manual for Solr 5.0. It can be downloaded from the Apache mirror network: https://s.apache.org/Solr-Ref-Guide-PDF Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. -- Anshum Gupta http://about.me/anshumgupta -- ** *This e-mail transmission, including any attachments, is intended only for the named recipient(s) and may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you have received this transmission in error, or are not the named recipient(s), please notify Senseta immediately by return e-mail and permanently delete
Re: how to debug solr performance degradation
Be careful what you think is being used by Solr since Lucene uses MMapDirectories under the covers, and this means you might be seeing virtual memory. See Uwe's excellent blog here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best, Erick On Tue, Feb 24, 2015 at 5:02 PM, Walter Underwood wun...@wunderwood.org wrote: The other memory is used by the OS as file buffers. All the important parts of the on-disk search index are buffered in memory. When the Solr process wants a block, it is already right there, no delays for disk access. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 24, 2015, at 4:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: how to debug solr performance degradation
Rebecca You don’t want to give all the memory to the JVM. You want to give it just enough for it to work optimally and leave the rest of the memory for the OS to use for caching data. Giving the JVM too much memory can result in worse performance because of GC. There is no magic formula to figuring out the memory allocation for the JVM, that is very dependent on the workload. In your case I would start with 5GB, and increment by 5GB with each run. I also use these settings for the JVM -XX:+UseG1GC -Xms1G -Xmx1G -XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200 I got them from this list so can’t take credit for them but they work for me. Cheers François On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: how to debug solr performance degradation
rebecca, i would suggest making sure you have some gc logging configured so you have some visibility into the JVM, esp if you don't already have JMX for sflow agent configured to give you external visibility of those internal metrics the options below just print out the gc activity to a log -Xloggc:gc.log -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintClassHistogram -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+PrintAdaptiveSizePolicy -XX:+PrintTLAB -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10m on the memory tuning side if things, as has already been mentioned, try to leave as much memory (outside the JVM) available to your OS to cache as much of the actual index as possible in your case, you have a lot of RAM, so i would suggest starting with the gc logging options above, plus these very basic JVM memory settings -XX:+UseG1GC -Xms2G -Xmx4G -XX:+UseAdaptiveSizePolicy -XX:MaxGCPauseMillis=1000 -XX:GCTimeRatio=19 in short, start by letting the JVM tune itself ;) then start looking at the actual GC behavior (this will be visible in the gc logs) --- on the OS performance monitoring, a few real time tools which i like to use on linux nmon dstat htop for trending start with the basics (sysstat/sar) and build from there (hsflowd is super easy to install and get pushing data up to a central console like ganglia) you can add to that by adding the sflow JVM agent to your solr environment enabling JMX interface on jetty will let you use tools like jconsole or jvisualvm From: François Schiettecatte fschietteca...@gmail.com Sent: Tuesday, February 24, 2015 17:06 To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation Rebecca You don’t want to give all the memory to the JVM. You want to give it just enough for it to work optimally and leave the rest of the memory for the OS to use for caching data. Giving the JVM too much memory can result in worse performance because of GC. There is no magic formula to figuring out the memory allocation for the JVM, that is very dependent on the workload. In your case I would start with 5GB, and increment by 5GB with each run. I also use these settings for the JVM -XX:+UseG1GC -Xms1G -Xmx1G -XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200 I got them from this list so can’t take credit for them but they work for me. Cheers François On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: how to debug solr performance degradation
meant to type JMX or sflow agent also should have mentioned you want to be running a very recent JDK From: Boogie Shafer boogie.sha...@proquest.com Sent: Tuesday, February 24, 2015 18:03 To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation rebecca, i would suggest making sure you have some gc logging configured so you have some visibility into the JVM, esp if you don't already have JMX for sflow agent configured to give you external visibility of those internal metrics the options below just print out the gc activity to a log -Xloggc:gc.log -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintClassHistogram -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:+PrintAdaptiveSizePolicy -XX:+PrintTLAB -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10m on the memory tuning side if things, as has already been mentioned, try to leave as much memory (outside the JVM) available to your OS to cache as much of the actual index as possible in your case, you have a lot of RAM, so i would suggest starting with the gc logging options above, plus these very basic JVM memory settings -XX:+UseG1GC -Xms2G -Xmx4G -XX:+UseAdaptiveSizePolicy -XX:MaxGCPauseMillis=1000 -XX:GCTimeRatio=19 in short, start by letting the JVM tune itself ;) then start looking at the actual GC behavior (this will be visible in the gc logs) --- on the OS performance monitoring, a few real time tools which i like to use on linux nmon dstat htop for trending start with the basics (sysstat/sar) and build from there (hsflowd is super easy to install and get pushing data up to a central console like ganglia) you can add to that by adding the sflow JVM agent to your solr environment enabling JMX interface on jetty will let you use tools like jconsole or jvisualvm From: François Schiettecatte fschietteca...@gmail.com Sent: Tuesday, February 24, 2015 17:06 To: solr-user@lucene.apache.org Subject: Re: how to debug solr performance degradation Rebecca You don’t want to give all the memory to the JVM. You want to give it just enough for it to work optimally and leave the rest of the memory for the OS to use for caching data. Giving the JVM too much memory can result in worse performance because of GC. There is no magic formula to figuring out the memory allocation for the JVM, that is very dependent on the workload. In your case I would start with 5GB, and increment by 5GB with each run. I also use these settings for the JVM -XX:+UseG1GC -Xms1G -Xmx1G -XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200 I got them from this list so can’t take credit for them but they work for me. Cheers François On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if
Re: how to debug solr performance degradation
The other memory is used by the OS as file buffers. All the important parts of the on-disk search index are buffered in memory. When the Solr process wants a block, it is already right there, no delays for disk access. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Feb 24, 2015, at 4:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: how to debug solr performance degradation
We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: Regarding behavior of docValues.
So for a requirement where I have a field which is used for sorting, faceting and searching what should be the better field definition. Can it be *field name=manu_exact type=string indexed=true stored=false docValues=true /* or Two fields each for sorting+faceting and for searching like following. *field name=manu_exact type=string indexed=true stored=false /field name=manu_exact_sort type=string indexed=false stored=false docValues=true /* Kindly note that it will be better if can use existing field for sorting, faceting and add searching on it like in example one above. Regards, Modassar On Tue, Feb 24, 2015 at 11:15 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's not my understanding. docValues are simply a different layout for storing the _indexed_ values that facilitates rapid loading of the field from disk, essentially putting the uninverted field value in a conveniently-loadable form. So AFAIK, the field is stored only once and used for all three, sorting, faceting and searching. Best, Erick On Tue, Feb 24, 2015 at 4:13 AM, Modassar Ather modather1...@gmail.com wrote: Thanks for your response Mikhail. On Tue, Feb 24, 2015 at 5:35 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Both statements seem true to me. On Tue, Feb 24, 2015 at 2:49 PM, Modassar Ather modather1...@gmail.com wrote: Hi, Kindly help me understand the behavior of following field. field name=manu_exact type=string indexed=true stored=false docValues=true / For a field like above where indexed=true and docValues=true, is it that: 1) For sorting/faceting on *manu_exact* the docValues will be used. 2) For querying on *manu_exact* the inverted index will be used. Thanks, Modassar -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com