Re: Solr using a ridiculous amount of memory
That was strange. As you are using a multi-valued field with the new setup, they should appear there. Yes, the new field we use for faceting is a multi valued field. Can you find the facet fields in any of the other caches? Yes, here it is, in the field cache: http://screencast.com/t/mAwEnA21yL I hope you are not calling the facets with facet.method=enum? Could you paste a typical facet-enabled search request? Here is a typical example (I added newlines for readability): http://172.22.51.111:8000/solr/default1_Danish/search ?defType=edismax q=*%3a* facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_7+key%3ditemvariantoptions_int_mv_7%7ditemvariantoptions_int_mv facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_9+key%3ditemvariantoptions_int_mv_9%7ditemvariantoptions_int_mv facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_8+key%3ditemvariantoptions_int_mv_8%7ditemvariantoptions_int_mv facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_2+key%3ditemvariantoptions_int_mv_2%7ditemvariantoptions_int_mv fq=site_guid%3a(10217) fq=item_type%3a(PRODUCT) fq=language_guid%3a(1) fq=item_group_1522_combination%3a(*) fq=is_searchable%3a(True) sort=item_group_1522_name_int+asc, variant_of_item_guid+asc querytype=Technical fl=feed_item_serialized facet=true group=true group.facet=true group.ngroups=true group.field=groupby_variant_of_item_guid group.sort=name+asc rows=0 Are you warming all the sort- and facet-fields? I'm sorry, I don't know. I have the field value cache commented out in my config, so... Whatever is default? Removing the custom sort fields is unfortunately quite a bit more difficult than my other facet modification. The problem is that each item can have several sort orders. The sort order to use is defined by a group number which is known ahead of time. The group number is included in the sort order field name. To solve it in the same way i solved the facet problem, I would need to be able to sort on a multi-valued field, and unless I'm wrong, I don't think that it's possible. I am quite stomped on how to fix this. On Wed, Apr 17, 2013 at 3:06 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: John Nielsen [j...@mcb.dk]: I never seriously looked at my fieldValueCache. It never seemed to get used: http://screencast.com/t/YtKw7UQfU That was strange. As you are using a multi-valued field with the new setup, they should appear there. Can you find the facet fields in any of the other caches? ...I hope you are not calling the facets with facet.method=enum? Could you paste a typical facet-enabled search request? Yep. We still do a lot of sorting on dynamic field names, so the field cache has a lot of entries. (9.411 entries as we speak. This is considerably lower than before.). You mentioned in an earlier mail that faceting on a field shared between all facet queries would bring down the memory needed. Does the same thing go for sorting? More or less. Sorting stores the raw string representations (utf-8) in memory so the number of unique values has more to say than it does for faceting. Just as with faceting, a list of pointers from documents to values (1 value/document as we are sorting) is maintained, so the overhead is something like #documents*log2(#unique_terms*average_term_length) + #unique_terms*average_term_length (where average_term_length is in bits) Caveat: This is with the index-wide sorting structure. I am fairly confident that this is what Solr uses, but I have not looked at it lately so it is possible that some memory-saving segment-based trickery has been implemented. Does those 9411 entries duplicate data between them? Sorry, I do not know. SOLR- discusses the problems with the field cache and duplication of data, but I cannot infer if it is has been solved or not. I am not familiar with the stat breakdown of the fieldCache, but it _seems_ to me that there are 2 or 3 entries for each segment for each sort field. Guesstimating further, let's say you have 30 segments in your index. Going with the guesswork, that would bring the number of sort fields to 9411/3/30 ~= 100. Looks like you use a custom sort field for each client? Extrapolating from 1.4M documents and 180 clients, let's say that there are 1.4M/180/5 unique terms for each sort-field and that their average length is 10. We thus have 1.4M*log2(1500*10*8) + 1500*10*8 bit ~= 23MB per sort field or about 4GB for all the 180 fields. With this few unique values, the doc-value structure is by far the biggest, just as with facets. As opposed to the faceting structure, this is fairly close to the actual memory usage. Switching to a single sort field would reduce the memory usage from 4GB to about 55MB. I do commit a bit more often than i should. I get these in my log file from time to time: PERFORMANCE WARNING: Overlapping onDeckSearchers=2 So 1 active searcher and 2 warming searchers. Ignoring that one of the warming searchers is highly likely to
Solr 4.2 fl issue
We are getting an issue when using a GUID got a field in Solr 4.2. Solr 3.6 is fine. Something like: fl=098765-765-788558-7654_userid as a string stored. The issue is when the GUID is begging with numeric and then a minus. This is a bug -- Bill Bell billnb...@gmail.com cell 720-256-8076
Lucene Sorting
Hi, We are facing sorting issue on the data indexed using Solr. Below is the sample code. Problem is, data returned by the below code is not properly sorted i.e. there's no ordering of data. Can anyone assist me on this? TopDocs topDocs = null; Directory directory = FSDirectory.open(indexDir); IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory)); Sort column = new Sort(new SortField(sortColumn, SortField.STRING, reverse)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); queryParser = new QueryParser(Version.LUCENE_36, fieldName, analyzer); queryParser.setAllowLeadingWildcard(true); queryParser.setDefaultOperator(Operator.AND); topDocs = searcher.search(queryParser.parse(queryStr), filter, maxHits, column); Thanks! Regards, Pankaj Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: facet.method enum vs fc
On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen
SolrCloud loadbalancing, replication, and failover
Step 1: distribute processing We have 2 servers in which we'll run 2 SolrCloud instances on. We'll define 2 shards so that both servers are busy for each request (improving response time of the request). Step 2: Failover We would now like to ensure that if either of the servers goes down (we're very unlucky with disks), that the other will be able to take over automatically. So we define 2 shards with a replication factor of 2. So we have: . Server 1: Shard 1, Replica 2 . Server 2: Shard 2, Replica 1 Question: But in SolrCloud, replicas are active right? So isn't it now possible that the load balancer will have Server 1 process *both* parts of a request, after all, it has both shards due to the replication, right?
Re: Select Queris While Merging Indexes
Thanks for explanations. I should read deep about the lifecycle of Searcher objects. Should I read them from a Lucene book or is there any Solr documentation or books covers it? 2013/4/18 Jack Krupansky j...@basetechnology.com merging indexes The proper terminology is merging segments. Until the new, merged segment is complete, the existing segments remain untouched and readable. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 17, 2013 6:28 PM To: solr-user@lucene.apache.org Subject: Select Queris While Merging Indexes I see that while merging indexes (I mean optimizing via admin gui), my Solr instance can still response select queries (as well). How that querying mechanism works (because merging not finished yet but my Solr instance still can return a consistent response)?
Re: Max http connections in CloudSolrServer
Thanks for this. The reason i asked this was.. when i fire 30 queries simultaneously from 30 threads using the same CloudSolrServer instance, some queries gets fired after a delay.. sometime the delay is 30-50 seconds... In solr logs i can see.. 20+ queries get fired almost immediately... but some of them gets fired late.. i increased the connections per host from 32 to 200.. still no respite... ./zahoor On 18-Apr-2013, at 12:20 AM, Shawn Heisey s...@elyograg.org wrote: ModifiableSolrParams params = new ModifiableSolrParams(); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200); HttpClient client = HttpClientUtil.createClient(params); LBHttpSolrServer lbServer = new LBHttpSolrServer (client, http://localhost/solr;); lbServer.removeSolrServer(http://localhost/solr;); SolrServer server = new CloudSolrServer(zkHost, lbServer);
Re: Solr using a ridiculous amount of memory
On Thu, 2013-04-18 at 08:34 +0200, John Nielsen wrote: [Toke: Can you find the facet fields in any of the other caches?] Yes, here it is, in the field cache: http://screencast.com/t/mAwEnA21yL Ah yes, mystery solved, my mistake. http://172.22.51.111:8000/solr/default1_Danish/search [...] fq=site_guid%3a(10217) This constraints to hits to a specific customer, right? Any search will only be in a single customer's data? [Toke: Are you warming all the sort- and facet-fields?] I'm sorry, I don't know. I have the field value cache commented out in my config, so... Whatever is default? (a bit shaky here) I would say not warming. You could check simply by starting solr and looking at the caches before you issue any searches. This fits the description of your searchers gradually eating memory until your JVM OOMs. Each time a new field is faceted or sorted upon, it it added to the cache. As your index is relatively small and the number of values in the single fields is small, the initialization time for a field is so short that it is not a performance problem. Memory wise is is death by a thousand cuts. If you did explicit warming of all the possible fields for sorting and faceting, your would allocate it all up front and would be sure that there would be enough memory available. But it would take much longer than your current setup. You might want to try it out (no need to fiddle with Solr setup, just make a script and fire wgets as this has the same effect). The problem is that each item can have several sort orders. The sort order to use is defined by a group number which is known ahead of time. The group number is included in the sort order field name. To solve it in the same way i solved the facet problem, I would need to be able to sort on a multi-valued field, and unless I'm wrong, I don't think that it's possible. That is correct. Three suggestions off the bat: 1) Reduce the number of sort fields by mapping names. Count the maximum number of unique sort fields for any given customer. That will be the total number of sort fields in the index. For each group number for a customer, map that number to one of the index-wide sort fields. This only works if the maximum number of unique fields is low (let's say a single field takes 50MB, so 20 fields should be okay). 2) Create a custom sorter for Solr. Create a field with all the sort values, prefixed by group ID. Create a structure (or reuse the one from Lucene) with a doc-terms map with all the terms in-memory. When sorting, extract the relevant compare-string for a document by iterating all the terms for the document and selecting the one with the right prefix. Memory wise this scales linear to the number of terms instead of the number of fields, but it would require quite some coding. 3) Switch to a layout where each customer has a dedicated core. The basic overhead is a lot larger than for a shared index, but it would make your setup largely immune to the adverse effect of many documents coupled with many facet- and sort-fields. - Toke Eskildsen, State and University Library, Denmark
Re: SolrCloud vs Solr master-slave replication
Thank you again for your answer Shawn. Network card seems to work fine, but we've found segmentation faults, so now our hosting provider is going to run a full hw check. Hopefully they'll replace the server and problem wil be solved Regards, Victor -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud vs Solr master-slave replication
Also, I forgot to say... the same error started to happen again.. the index is again corrupted :( -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056926.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr using a ridiculous amount of memory
http://172.22.51.111:8000/solr/default1_Danish/search [...] fq=site_guid%3a(10217) This constraints to hits to a specific customer, right? Any search will only be in a single customer's data? Yes, thats right. No search from any given client ever returns anything from another client. [Toke: Are you warming all the sort- and facet-fields?] I'm sorry, I don't know. I have the field value cache commented out in my config, so... Whatever is default? (a bit shaky here) I would say not warming. You could check simply by starting solr and looking at the caches before you issue any searches. The field cache shows 0 entries at startup. On the running server, forcing a commit (and thus opening a new searcher) does not change the number of entries. The problem is that each item can have several sort orders. The sort order to use is defined by a group number which is known ahead of time. The group number is included in the sort order field name. To solve it in the same way i solved the facet problem, I would need to be able to sort on a multi-valued field, and unless I'm wrong, I don't think that it's possible. That is correct. Three suggestions off the bat: 1) Reduce the number of sort fields by mapping names. Count the maximum number of unique sort fields for any given customer. That will be the total number of sort fields in the index. For each group number for a customer, map that number to one of the index-wide sort fields. This only works if the maximum number of unique fields is low (let's say a single field takes 50MB, so 20 fields should be okay). I just checked our DB. Our worst case scenario client has over a thousand groups for sorting. Granted, it may be, probably is, an error with the data. It is an interesting idea though and I will look into this posibility. 3) Switch to a layout where each customer has a dedicated core. The basic overhead is a lot larger than for a shared index, but it would make your setup largely immune to the adverse effect of many documents coupled with many facet- and sort-fields. Now this is where my brain melts down. If I understand the fieldCache mechanism correctly (which i can see that I don't), the data used for faceting and sorting is saved in the fieldCache using a key comprised of the fields used for said faceting/sorting. That data only contains the data which is actually used for the operation. This is what the fq queries are for. So if i generate a core for each client, I would have a client specific fieldCache containing the data from that client. Wouldn't I just split up the same data into several cores? I'm afraid I don't understand how this would help. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
On Thu, 2013-04-18 at 11:59 +0200, John Nielsen wrote: Yes, thats right. No search from any given client ever returns anything from another client. Great. That makes the 1 core/client solution feasible. [No sort facet warmup is performed] [Suggestion 1: Reduce the number of sort fields by mapping] [Suggestion 3: 1 core/customer] If I understand the fieldCache mechanism correctly (which i can see that I don't), the data used for faceting and sorting is saved in the fieldCache using a key comprised of the fields used for said faceting/sorting. That data only contains the data which is actually used for the operation. This is what the fq queries are for. You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. It might help to visualize the structures as arrays of values with docID as index: String[] myValues = new String[140] takes up 1.4M * 32 bit (or more for a 64 bit machine) = 5.6MB, even when it is empty. Note: Neither String-objects, nor Java references are used for the real facet- and sort-structures, but the principle is quite the same. So if i generate a core for each client, I would have a client specific fieldCache containing the data from that client. Wouldn't I just split up the same data into several cores? The same terms, yes, but not the same references. Let's say your customer has 10K documents in the index and that there are 100 unique values, each 10 bytes long, in each group . As each group holds its own separate structure, we use the old formula to get the memory overhead: #documents*log2(#unique_terms*average_term_length) + #unique_terms*average_term_length 1.4M*log2(100*(10*8)) + 100*(10*8) bit = 1.2MB + 1KB. Note how the values themselves are just 1KB, while the nearly empty reference list takes 1.2MB. Compare this to a dedicated core with just the 10K documents: 10K*log2(100*(10*8)) + 100*(10*8) bit = 8.5KB + 1KB. The terms take up exactly the same space, but the heap requirement for the references is reduced by 99%. Now, 25GB for 180 clients means 140MB/client with your current setup. I do not know the memory overhead of running a core, but since Solr can run fine with 32MB for small indexes, it should be smaller than that. You will of course have to experiment and to measure. - Toke Eskildsen, State and University Library, Denmark
TooManyClauses: maxClauseCount is set to 1024
Its quite confusing about this error. I had a situation where i have to turn on the highlighting. In some cases though the number of docs found for a particular query was for example say 2, the highlighting was coming only for 1. I did some checks and found that that particular text searched was in the bigger document and was towards the end of the document. So i increased the hl.maxAnalyzedChar from default value of 51200 to a bigger value say 500. And then it started working, i mean now the highlighting was working properly. Now, i have encountered one more problem with the same error. There is a document which is returning the maxClauseCount error when i do search on content:*. The document is quite big big in size and the hl.maxAnalyzedChars was default i.e. 51200. I tried decreasing that i found that the error is coming exactly at 31375 char (this is did my setting the hl.maxAnalyzedChars to 31375. it worked fine till 31374). Solutions are most welcome as i am in great need of this. Sample query is as below http://localhost:8983/solr/test/select/?q=content:* AND obs_date:[2010-01-01T00:00:00Z%20TO%202011-12-31T23:59:59Z]fl=contenthl=truehl.fl=contenthl.snippets=1hl.fragsize=1500hl.requireFieldMatch=truehl.alternateField=contenthl.maxAlternateFieldLength=1500hl.maxAnalyzedChars=31375facet.limit=200facet.mincount=1start=64rows=1sort=obs_date%20desc Regards, Sawan -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TooManyClauses: maxClauseCount is set to 1024
Just increase the value of /maxClauseCount/ in your solrconfig.xml. Keep it large enough. Best Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056966.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TooManyClauses: maxClauseCount is set to 1024
Can you provide a full stack trace of the exception? There's a maxClauseCount in solrconfig.xml that you can increase to work around the issue. -Yonik http://lucidworks.com On Thu, Apr 18, 2013 at 7:31 AM, sawanverma sawan.ve...@glassbeam.com wrote: Its quite confusing about this error. I had a situation where i have to turn on the highlighting. In some cases though the number of docs found for a particular query was for example say 2, the highlighting was coming only for 1. I did some checks and found that that particular text searched was in the bigger document and was towards the end of the document. So i increased the hl.maxAnalyzedChar from default value of 51200 to a bigger value say 500. And then it started working, i mean now the highlighting was working properly. Now, i have encountered one more problem with the same error. There is a document which is returning the maxClauseCount error when i do search on content:*. The document is quite big big in size and the hl.maxAnalyzedChars was default i.e. 51200. I tried decreasing that i found that the error is coming exactly at 31375 char (this is did my setting the hl.maxAnalyzedChars to 31375. it worked fine till 31374). Solutions are most welcome as i am in great need of this. Sample query is as below http://localhost:8983/solr/test/select/?q=content:* AND obs_date:[2010-01-01T00:00:00Z%20TO%202011-12-31T23:59:59Z]fl=contenthl=truehl.fl=contenthl.snippets=1hl.fragsize=1500hl.requireFieldMatch=truehl.alternateField=contenthl.maxAlternateFieldLength=1500hl.maxAnalyzedChars=31375facet.limit=200facet.mincount=1start=64rows=1sort=obs_date%20desc Regards, Sawan -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: TooManyClauses: maxClauseCount is set to 1024
Thanks Pravesh. But won't that hit the query performance? Still what would be the ideal value to increase? Say this error may come even if we increase the value from 1024 to say 5120? Have tried increasing the value and it had hit the performance. Regards, Sawan From: pravesh [via Lucene] [mailto:ml-node+s472066n4056966...@n3.nabble.com] Sent: Thursday, April 18, 2013 5:06 PM To: Sawan Verma Subject: Re: TooManyClauses: maxClauseCount is set to 1024 Just increase the value of maxClauseCount in your solrconfig.xml. Keep it large enough. Best Pravesh If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056966.html To unsubscribe from TooManyClauses: maxClauseCount is set to 1024, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4056965code=c2F3YW4udmVybWFAZ2xhc3NiZWFtLmNvbXw0MDU2OTY1fC0xMTI5MDQ2NDY1. NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056968.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TooManyClauses: maxClauseCount is set to 1024
Update: Also remove your range queries from the main query and specify it as a filter query. Best Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056969.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: TooManyClauses: maxClauseCount is set to 1024
Hi Yonik, Thanks for your reply. I tried increasing the maxClauseCount to a bigger value. But what could be the ideal value and will not that hit the performance? What are the chances that if we increase the value we will not face this issue again? As you asked pasting below the full trace of the error Problem accessing /solr/ar/select/. Reason: maxClauseCount is set to 1024 org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127) at org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:51) at org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:55) at org.apache.lucene.search.ScoringRewrite$3.collect(ScoringRewrite.java:95) at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38) at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:312) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:391) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Regards, Sawan From: Yonik Seeley-4 [via Lucene] [mailto:ml-node+s472066n4056967...@n3.nabble.com] Sent: Thursday, April 18, 2013 5:09 PM To: Sawan Verma Subject: Re: TooManyClauses: maxClauseCount is set to 1024 Can you provide a full stack trace of the exception? There's a maxClauseCount in solrconfig.xml that you can increase to work around the issue. -Yonik http://lucidworks.com On Thu, Apr 18, 2013 at 7:31 AM, sawanverma [hidden email]/user/SendEmail.jtp?type=nodenode=4056967i=0 wrote: Its quite confusing about this error. I had a situation where i have to turn on the highlighting. In some cases though the number of docs found for a particular query was for example say 2, the highlighting was coming only for 1. I did some checks and found that that particular text searched was in the bigger document and was towards the end of the document. So i increased the hl.maxAnalyzedChar from default value of 51200 to a
RE: TooManyClauses: maxClauseCount is set to 1024
Yonik, When i remove the sort part from the query below it works fine. But with sort it throws the exception http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc -- Throws Exception http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1 -- Works fine. From the above its clear that sort is causing the problem. Any idea why is this happening and how to fix this? Regards, sawan -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4056974.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr using a ridiculous amount of memory
You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. Wow, thank you for this awesome explanation! This is where the penny dropped for me. I will definetely move to a multi-core setup. It will take some time and a lot of re-coding. As soon as I know the result, I will let you know! -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
stats.facet not working for timestamp field
Hi I am using SOlr 4.1 with 6 shards. i want to find out some price stats for all the days in my index. I ended up using stats component like stats=truestats.field=pricestats.facet=timestamp. but it throws up error like str name=msgInvalid Date String:' #1;#0;#0;#0;'[my(#0;'/str My Question is : is timestamp supported as stats.facet ? ./zahoor
Re: zkState changes too often
On 16-Apr-2013, at 11:16 PM, Mark Miller markrmil...@gmail.com wrote: Are you using a the concurrent low pause garbage collector or perhaps G1? I use the default one which comes in jdk 1.7. Are you able to use something like visualvm to pinpoint what the bottleneck might be? Unfortunately.. it is prod machine and i could not replicate it locally. Otherwise, keep raising the timeout. Thats what i did now.. will see if it comes in the next run.. ./zahoor
Re: Max http connections in CloudSolrServer
I dont yet know if this is the reason... I am looking if jetty has some limit on accepting connections.. ./zahoor On 18-Apr-2013, at 12:52 PM, J Mohamed Zahoor zah...@indix.com wrote: Thanks for this. The reason i asked this was.. when i fire 30 queries simultaneously from 30 threads using the same CloudSolrServer instance, some queries gets fired after a delay.. sometime the delay is 30-50 seconds... In solr logs i can see.. 20+ queries get fired almost immediately... but some of them gets fired late.. i increased the connections per host from 32 to 200.. still no respite... ./zahoor On 18-Apr-2013, at 12:20 AM, Shawn Heisey s...@elyograg.org wrote: ModifiableSolrParams params = new ModifiableSolrParams(); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 1000); params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 200); HttpClient client = HttpClientUtil.createClient(params); LBHttpSolrServer lbServer = new LBHttpSolrServer (client, http://localhost/solr;); lbServer.removeSolrServer(http://localhost/solr;); SolrServer server = new CloudSolrServer(zkHost, lbServer);
Re: zkState changes too often
On Apr 18, 2013, at 8:40 AM, jmozah jmo...@gmail.com wrote: On 16-Apr-2013, at 11:16 PM, Mark Miller markrmil...@gmail.com wrote: Are you using a the concurrent low pause garbage collector or perhaps G1? I use the default one which comes in jdk 1.7. It varies by platform, but 99% that means you are using the throughput collector and you should try the CMS collector instead. - Mark Are you able to use something like visualvm to pinpoint what the bottleneck might be? Unfortunately.. it is prod machine and i could not replicate it locally. Otherwise, keep raising the timeout. Thats what i did now.. will see if it comes in the next run.. ./zahoor
Re: Solr 4.2 fl issue
Hi, What is the issue though? :) Otis Solr ElasticSearch Support http://sematext.com/ On Apr 18, 2013 2:53 AM, William Bell billnb...@gmail.com wrote: We are getting an issue when using a GUID got a field in Solr 4.2. Solr 3.6 is fine. Something like: fl=098765-765-788558-7654_userid as a string stored. The issue is when the GUID is begging with numeric and then a minus. This is a bug -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: SolrCloud loadbalancing, replication, and failover
Correct. This is what you want if server 2 goes down. Otis Solr ElasticSearch Support http://sematext.com/ On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote: Step 1: distribute processing We have 2 servers in which we'll run 2 SolrCloud instances on. We'll define 2 shards so that both servers are busy for each request (improving response time of the request). Step 2: Failover We would now like to ensure that if either of the servers goes down (we're very unlucky with disks), that the other will be able to take over automatically. So we define 2 shards with a replication factor of 2. So we have: . Server 1: Shard 1, Replica 2 . Server 2: Shard 2, Replica 1 Question: But in SolrCloud, replicas are active right? So isn't it now possible that the load balancer will have Server 1 process *both* parts of a request, after all, it has both shards due to the replication, right?
Re: Select Queris While Merging Indexes
If you understand the underlying lucene searcher it will be easy to understand what's happening at solr level. Otis Solr ElasticSearch Support http://sematext.com/ On Apr 18, 2013 3:22 AM, Furkan KAMACI furkankam...@gmail.com wrote: Thanks for explanations. I should read deep about the lifecycle of Searcher objects. Should I read them from a Lucene book or is there any Solr documentation or books covers it? 2013/4/18 Jack Krupansky j...@basetechnology.com merging indexes The proper terminology is merging segments. Until the new, merged segment is complete, the existing segments remain untouched and readable. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 17, 2013 6:28 PM To: solr-user@lucene.apache.org Subject: Select Queris While Merging Indexes I see that while merging indexes (I mean optimizing via admin gui), my Solr instance can still response select queries (as well). How that querying mechanism works (because merging not finished yet but my Solr instance still can return a consistent response)?
RE: Tokenize on paragraphs and sentences
Thanks, Jack. Sorry, took me a while to reply :) It sounds like sentence/paragraph level searches won't be easy. Warm regards, Alex -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: 15 April 2013 5:09 PM To: solr-user@lucene.apache.org Subject: Re: Tokenize on paragraphs and sentences Technically, yes, but you would have to do a lot of work yourself. Like, a sentence/paragraph recognizer that inserted sentence and paragraph markers, and a query parser that allows you to do SpanNear and SpanNot (to selectively exclude sentence or paragraph marks based on your granularity of search.) The LucidWorks Search query parser has SpanNot support (or at least did at one point in time), but no sentence/paragraph marking. You could come up with some heuristic regular expressions for sentence and paragraph marks, like consecutive newlines for a paragraph and dot followed by white space for sentence (with some more heuristics for abbreviations.) Or you could have an update processor do the marking. -- Jack Krupansky -Original Message- From: Alex Cougarman Sent: Monday, April 15, 2013 9:48 AM To: solr-user@lucene.apache.org Subject: Tokenize on paragraphs and sentences Hi. Is it possible to search within paragraphs or sentences in Solr? The PatternTokenizerFactory uses regular expressions, but how can this be done with plain ASCII docs that don't have p tags (HTML), yet they're broken into paragraphs? Thanks. Warm regards, Alex
RE: SolrCloud loadbalancing, replication, and failover
But my concern is this, when we have just 2 servers: - I want 1 to be able to take over in case the other fails, as you point out. - But when *both* servers are up I don't want the SolrCloud load balancer to have Shard1 and Replica2 do the work (as they would both reside on the same physical server). Does that make sense? I want *both* server1 server2 sharing the processing of every request, *and* I want the failover capability. I'm probably missing some bit of logic here, but I want to be sure I understand the architecture. Dave -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Thursday, April 18, 2013 8:13 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Correct. This is what you want if server 2 goes down. Otis Solr ElasticSearch Support http://sematext.com/ On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote: Step 1: distribute processing We have 2 servers in which we'll run 2 SolrCloud instances on. We'll define 2 shards so that both servers are busy for each request (improving response time of the request). Step 2: Failover We would now like to ensure that if either of the servers goes down (we're very unlucky with disks), that the other will be able to take over automatically. So we define 2 shards with a replication factor of 2. So we have: . Server 1: Shard 1, Replica 2 . Server 2: Shard 2, Replica 1 Question: But in SolrCloud, replicas are active right? So isn't it now possible that the load balancer will have Server 1 process *both* parts of a request, after all, it has both shards due to the replication, right?
Re: SolrCloud loadbalancing, replication, and failover
Hi Dave, This sounds more like a budget / deployment issue vs. anything architectural. You want 2 shards with replication so you either need sufficient capacity on each of your 2 servers to host 2 Solr instances or you need 4 servers. You need to avoid starving Solr of necessary RAM, disk performance, and CPU regardless of how you lay out the cluster otherwise performance will suffer. My guess is if each Solr had sufficient resources, you wouldn't actually notice much difference in query performance. Tim On Thu, Apr 18, 2013 at 8:03 AM, David Parks davidpark...@yahoo.com wrote: But my concern is this, when we have just 2 servers: - I want 1 to be able to take over in case the other fails, as you point out. - But when *both* servers are up I don't want the SolrCloud load balancer to have Shard1 and Replica2 do the work (as they would both reside on the same physical server). Does that make sense? I want *both* server1 server2 sharing the processing of every request, *and* I want the failover capability. I'm probably missing some bit of logic here, but I want to be sure I understand the architecture. Dave -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Thursday, April 18, 2013 8:13 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Correct. This is what you want if server 2 goes down. Otis Solr ElasticSearch Support http://sematext.com/ On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote: Step 1: distribute processing We have 2 servers in which we'll run 2 SolrCloud instances on. We'll define 2 shards so that both servers are busy for each request (improving response time of the request). Step 2: Failover We would now like to ensure that if either of the servers goes down (we're very unlucky with disks), that the other will be able to take over automatically. So we define 2 shards with a replication factor of 2. So we have: . Server 1: Shard 1, Replica 2 . Server 2: Shard 2, Replica 1 Question: But in SolrCloud, replicas are active right? So isn't it now possible that the load balancer will have Server 1 process *both* parts of a request, after all, it has both shards due to the replication, right?
more results when adding more criterias
Hi, I have a field which has data like this: letters letters numbers letters numbers letters numbers Where letters can have from 1 to 10 letters strings and number can have up to 4 digits. It is defined like this: field name=myField type=myFieldType indexed=true stored=true multiValued=true / fieldType name=myFieldType class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType When the user enters foo, i search for foo directly or something that starts with foo . I don't want to find fool or foop or anything like this. I also allow users to enter terms that they don't want to find. So a query for foo NOT foo 123 is converted to this: parsedquery_toString: +(+(myField:foo myField:foo *) +(-myField:foo 123 -myField:foo 123 * +*:*)), My problem is that this finds more entries then just foo, which converts to this: parsedquery_toString: +(myField:foo myField:foo *), I have read a bit about the internal solr logic, using MUST, SHOULD and MUST_NOT, but still don't understand. When I look at parsedquery_toString: +(+(myField:foo myField:foo *) +(-myField:foo 123 -myField:foo 123 * +*:*)), then I see two criteria A and B and both MUST be satisfied. Criteria A is the same as parsedquery_toString: +(myField:foo myField:foo *), so the number of results MUST be identical here. Since the final results must match both A and B, the number must be equal or lower than just A, right? Where do I think wrong? Thanks, Kai
solr4 : disable updateLog
Hi, If I disable (comment) the updateLog bloc, this will affect indexing result: -- View this message in context: http://lucene.472066.n3.nabble.com/solr4-disable-updateLog-tp4056998.html Sent from the Solr - User mailing list archive at Nabble.com.
Paging and sorting in Solr
I have done paging using solr rows and start query attributes. But now it shows me result with that is sorted page wise. I meant if i have the following scenario: rows=25start=0sort=manufacturer asc It will give me first 25 matching results and then sort only those. I want it to sort all the results first and then apply rows and start. How can i do that? -- View this message in context: http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Paging and sorting in Solr
I am sure it does the sorting first (since I always done that). On 04/18/2013 02:49 PM, hassancrowdc wrote: I have done paging using solr rows and start query attributes. But now it shows me result with that is sorted page wise. I meant if i have the following scenario: rows=25start=0sort=manufacturer asc It will give me first 25 matching results and then sort only those. I want it to sort all the results first and then apply rows and start. How can i do that? -- View this message in context: http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000.html Sent from the Solr - User mailing list archive at Nabble.com. -- Oussama Jilal
Re: Solr 4.2 fl issue
When using a field name that doen't follow conventions (basically like Java identifiers), try this: fl=field(098765-765-788558-7654_userid) Or enclose it in quotes if it's really a whacky field name: fl=field(098765-765-788558-7654_userid) -Yonik http://lucidworks.com On Thu, Apr 18, 2013 at 2:52 AM, William Bell billnb...@gmail.com wrote: We are getting an issue when using a GUID got a field in Solr 4.2. Solr 3.6 is fine. Something like: fl=098765-765-788558-7654_userid as a string stored. The issue is when the GUID is begging with numeric and then a minus. This is a bug -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Paging and sorting in Solr
Hi, I double checked. It is the field. if i sort through manufacturer field it sorts but if i sort through name it does not sort. both the field has everything same. Is there any difference in sorting alphabetically or size of the word? -- View this message in context: http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000p4057013.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr indexing
Solr is not showing the dates i have in database. any help? is solr following any specific timezone? On my database my date is 2013-04-18 11:29:33 but solr shows me 2013-04-18T15:29:33Z. Any help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-tp4057017.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing
On Apr 18, 2013, at 10:49 AM, hassancrowdc hassancrowdc...@gmail.com wrote: Solr is not showing the dates i have in database. any help? is solr following any specific timezone? On my database my date is 2013-04-18 11:29:33 but solr shows me 2013-04-18T15:29:33Z. Any help Solr knows nothing of timezones. Solr expects everything is in UTC. If you want time zone support, you'll have to convert local time to UTC before importing, and then convert back to local time from UTC when you read from Solr. xoa -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: Paging and sorting in Solr
Maybe you have your name field as text rather than string. Don't try sorting text fields - make a copy (copyField) to a string field and sort the string field. So, for example, have name as text for keyword search, and name_s as string for sorting (and faceting.) -- Jack Krupansky -Original Message- From: hassancrowdc Sent: Thursday, April 18, 2013 11:35 AM To: solr-user@lucene.apache.org Subject: Re: Paging and sorting in Solr Hi, I double checked. It is the field. if i sort through manufacturer field it sorts but if i sort through name it does not sort. both the field has everything same. Is there any difference in sorting alphabetically or size of the word? -- View this message in context: http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000p4057013.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: facet.method enum vs fc
20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen
Re: Solr indexing
Solr dates are always Z, GMT. -- Jack Krupansky -Original Message- From: hassancrowdc Sent: Thursday, April 18, 2013 11:49 AM To: solr-user@lucene.apache.org Subject: Solr indexing Solr is not showing the dates i have in database. any help? is solr following any specific timezone? On my database my date is 2013-04-18 11:29:33 but solr shows me 2013-04-18T15:29:33Z. Any help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-tp4057017.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TooManyClauses: maxClauseCount is set to 1024
On 4/18/2013 6:02 AM, sawanverma wrote: Hi Yonik, Thanks for your reply. I tried increasing the maxClauseCount to a bigger value. But what could be the ideal value and will not that hit the performance? What are the chances that if we increase the value we will not face this issue again? Changing the maxBooleanClauses value does not affect performance. It's just an arbitrary limit on query complexity. You can make it as big as you want and Solr's performance will not change. For most people, 1024 is plenty. For others, we have no idea how many clauses are needed. The queries themselves with large numbers of clauses are what affects performance, and the only way to improve it is to decrease the query complexity. Chances are good that you are already experiencing the performance hit associated with large queries. Adding more clauses to a query will reduce performance. If you find yourself in a situation where you continually need more boolean clauses, you may need to start over and create a better design. The maxBooleanClauses value is just a safety net, created long ago when Lucene worked differently than it does now. There is a discussion currently happening among committers about whether that limit even needs to exist. Very likely the limit in Solr will be removed in the near future. Thanks, Shawn
Re: Max http connections in CloudSolrServer
On 4/18/2013 6:42 AM, J Mohamed Zahoor wrote: I dont yet know if this is the reason... I am looking if jetty has some limit on accepting connections.. Are you using the Jetty included with Solr, or a Jetty installed separately? The Jetty included with Solr has a maxThreads value of 1 in its config. The default would be closer to 200, and a single request from a Cloud client likely uses multiple Jetty threads. Thanks, Shawn
Re: solr 3.5 core rename issue
yeah I realize using ${solr.core.name} for dataDir must be the cause for the issue we see... it is fair to say the SWAP and RENAME just create an alias that still points to the old datadir. if they can not fix it then it is not a bug :-) at least we understand exactly what is going on there. thanks so much for your help! Jie -- View this message in context: http://lucene.472066.n3.nabble.com/solr-3-5-core-rename-issue-tp4056425p4057037.html Sent from the Solr - User mailing list archive at Nabble.com.
shard query return 500 on large data set
Hi - when I execute a shard query like: [myhost]:8080/solr/mycore/select?q=type:messagerows=14...qt=standardwt=standardexplainOther=hl.fl=shards=solrserver1:8080/solr/mycore,solrserver2:8080/solr/mycore,solrserver3:8080/solr/mycore everything works fine until I query against a large set of data ( 100k documents), when the number of rows returned exceeds about 50k. by the way I am using HttpClient GET method to send the solr shard query over. In the above scenario, the query fails with a 500 server error as returned status code. I am using solr 3.5. I encountered a 404 before, when one of the shard servers does not have the core (404) the whole shard query will return 404 to me; so I expect if one of the server encounter a timeout (408?), the shard query should return time out status code? I guess I am not sure what will be the shard query results with various error scenario... guess i could look into solr code, but if you have any input, it will be appreciated. thanks Renee -- View this message in context: http://lucene.472066.n3.nabble.com/shard-query-return-500-on-large-data-set-tp4057038.html Sent from the Solr - User mailing list archive at Nabble.com.
Sorting on alias fields
Hi all, I am trying to sort results based on multiple fields aliased as one. Is that possible? While solr does not complain (no error, results OK, etc etc etc) it fails to sort the hits appropriately. I've attached the query, relevant schema part and result. I am very curious to know if that is a feature that is currently supported (sorting on aliases) Cheers, _Stephane schema: dynamicField name=*_sort required=false type=date indexed=true stored=true multiValued=false/ field name=last_modified type=date indexed=true stored=true multiValued=false/ field name=modified type=date indexed=true stored=true multiValued=false/ query: http://localhost:8983/select?q.alt=*%3A*q.op=ORrows=10start=0qt=%2Fselectq=qf=title_search%5E1.0pf=title_search%5E1.0fl=last_modified%2Cmodified%2Cmodified_sort%3Alast_modified%2Cmodified_sort%3AmodifieddebugOther=1debug=1debugQuery=truesort=modified_sort+desc result: response lst name=responseHeader int name=status0/int int name=QTime9/int lst name=params str name=sortmodified_sort desc/str str name=qftitle_search^1.0/str str name=q.alt*:*/str str name=debugOther1/str str name=debug1/str str name=rows10/str str name=pftitle_search^1.0/str str name=fl last_modified,modified,modified_sort:last_modified,modified_sort:modified /str str name=debugQuerytrue/str str name=start0/str str name=q/ str name=q.opOR/str str name=qt/select/str /lst /lst result name=response numFound=17 start=0 doc date name=last_modified2013-04-12T00:00:00Z/date date name=modified_sort2013-04-12T00:00:00Z/date /doc doc date name=last_modified2007-10-18T00:00:00Z/date date name=modified2007-10-18T00:00:00Z/date date name=modified_sort2007-10-18T00:00:00Z/date /doc doc date name=last_modified2013-04-12T00:00:00Z/date date name=modified_sort2013-04-12T00:00:00Z/date /doc doc
Re: SolrCloud vs Solr master-slave replication
Run checksums on all files in both master and slave, and verify that they are the same. TCP/IP has a checksum algorithm that was state-of-the-art in 1969. On 04/18/2013 02:10 AM, Victor Ruiz wrote: Also, I forgot to say... the same error started to happen again.. the index is again corrupted :( -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-vs-Solr-master-slave-replication-tp4055541p4056926.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: TooManyClauses: maxClauseCount is set to 1024
Shawn, Thanks a lot for your reply. But I am confused again if the following query is complex. http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc Is that because of content : *? The only unusual thing is the field size of the content. In this particular case content field has enormously big data. Since this problem comes only when we do a search on * for content field. Is there a way that we can split the doc size? Regards, Sawan From: Shawn Heisey-4 [via Lucene] [mailto:ml-node+s472066n4057027...@n3.nabble.com] Sent: 18 April 2013 PM 09:38 To: Sawan Verma Subject: Re: TooManyClauses: maxClauseCount is set to 1024 On 4/18/2013 6:02 AM, sawanverma wrote: Hi Yonik, Thanks for your reply. I tried increasing the maxClauseCount to a bigger value. But what could be the ideal value and will not that hit the performance? What are the chances that if we increase the value we will not face this issue again? Changing the maxBooleanClauses value does not affect performance. It's just an arbitrary limit on query complexity. You can make it as big as you want and Solr's performance will not change. For most people, 1024 is plenty. For others, we have no idea how many clauses are needed. The queries themselves with large numbers of clauses are what affects performance, and the only way to improve it is to decrease the query complexity. Chances are good that you are already experiencing the performance hit associated with large queries. Adding more clauses to a query will reduce performance. If you find yourself in a situation where you continually need more boolean clauses, you may need to start over and create a better design. The maxBooleanClauses value is just a safety net, created long ago when Lucene worked differently than it does now. There is a discussion currently happening among committers about whether that limit even needs to exist. Very likely the limit in Solr will be removed in the near future. Thanks, Shawn If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057027.html To unsubscribe from TooManyClauses: maxClauseCount is set to 1024, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4056965code=c2F3YW4udmVybWFAZ2xhc3NiZWFtLmNvbXw0MDU2OTY1fC0xMTI5MDQ2NDY1. NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057060.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Elevation Component
I want to elevate certain documents differently depending a a certain fq parameter in the request. I've read of somebody coding solr to do this but no code was shared. Where would I start looking to implement this feature myself? -- View this message in context: http://lucene.472066.n3.nabble.com/Query-Elevation-Component-tp4056856p4057065.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TooManyClauses: maxClauseCount is set to 1024
On 4/18/2013 11:53 AM, sawanverma wrote: Shawn, Thanks a lot for your reply. But I am confused again if the following query is complex. http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc I hardly know anything about highlighting, so nothing that I say here may have any relevance to your situtation at all. A query of content:* strikes me as an invalid query. If you are shooting for all documents where content exists and excluding those where it doesn't exist, I would think that 'q=content:[* TO *]' (the TO must be uppercase) would be a better option. Exactly how your query gets expanded into a something that exceeds maxBooleanClauses is a complete mystery to me, and probably does have something to do with the highlighting. Thanks, Shawn
updating documents unintentionally adds extra values to certain fields
Hi I am using solr 4.2, and have set up spatial search config as below http://wiki.apache.org/solr/SpatialSearch#Schema_Configuration But everything I make an update to a document, http://wiki.apache.org/solr/UpdateJSON#Updating_a_Solr_Index_with_JSON more values of the *_coordinates fields gets inserted, even though it was not set to multivalue this behavior doesn't happen to any of the other fields. Any ideas how to avoid adding extra values to the _coordinates fields on updates?
Making fields unavailable for return to specific end points.
We have a few internal fields that we would like to restrict from being returned in result sets. I have seen how fl is used in specify fields that you do what returned, I am kind of looking for the opposite. There are just a few fields that don't make sense to return to our clients. Is there any functionality for a blocked-fl? Thank you! -- Andrew NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
RE: Making fields unavailable for return to specific end points.
Hmm... Just found this JIRA: https://issues.apache.org/jira/browse/SOLR-3191 I think I have answered my question. -Original Message- From: Andrew Lundgren [mailto:lundg...@familysearch.org] Sent: Thursday, April 18, 2013 1:21 PM To: solr-user@lucene.apache.org Subject: Making fields unavailable for return to specific end points. We have a few internal fields that we would like to restrict from being returned in result sets. I have seen how fl is used in specify fields that you do what returned, I am kind of looking for the opposite. There are just a few fields that don't make sense to return to our clients. Is there any functionality for a blocked-fl? Thank you! -- Andrew NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Change the response of delta import
Is there any way i can change the response xml from delta import query: locathost:8080/solr/devices/dataimport?command=delta-importcommit=true I want to change the response. -- View this message in context: http://lucene.472066.n3.nabble.com/Change-the-response-of-delta-import-tp4057093.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Change the response of delta import
On 4/18/2013 1:59 PM, hassancrowdc wrote: Is there any way i can change the response xml from delta import query: locathost:8080/solr/devices/dataimport?command=delta-importcommit=true I want to change the response. The response is created by the dataimporthandler source code. It's a contrib module included with Solr. You can change that code and recompile, then replace your dataimporthandler jar with the new one. Thanks, Shawn
Re: Paging and sorting in Solr
thnx -- View this message in context: http://lucene.472066.n3.nabble.com/Paging-and-sorting-in-Solr-tp4057000p4057098.html Sent from the Solr - User mailing list archive at Nabble.com.
PositionLengthAttribute - Does it do anything at all?
I've been playing around with the PositionLengthAttribute for a few days, and it doesn't seem to have any effect at all. I'm aware that position length is not stored in the index, as explained in this blog post. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html However, even when used at query time it doesn't seem to do anything. Let's take the following token stream as an example. text: he posInc: 1 posLen: 1 text: cannot posInc: 1 posLen: 2 text: can posInc: 0 posLen: 1 text: not posInc: 1 posLen: 1 text: help posInc: 1 posLen: 1 If we were to construct this graph of tokens, it should match the phrases he can not help and he cannot help. According to my testing, it will match the phrases he can not help and he cannot not help, because the position length is entirely ignored and treated as if it is always 1. Am I misunderstanding how these attributes work? - Hayden
Re: What are the pros and cons Having More Replica at SolrCloud
re: more replicas - pro: you can scale your query processing workload because you have more nodes available to service queries, eg 1,000 QPS sent to Solr with 5 replicas, then each is only processing roughly 200 QPS. If you need to scale up to 10K QPS, then add more replicas to distribute the increased workload con: additional overhead (mostly network I/O) when indexing, shard leader has to send N additional requests per update where N is the number of replicas per shard. This seems minor unless you have many replicas per shard. I can't think of any cons of having more replicas on the query side As for your other question, when the leader receives an update request, it forwards to all replicas in the active or recovering state in parallel and waits for their response before responding to the client. All replicas must accept the update for it to be considered successful, i.e. all replicas and the leader must be in agreement on the status of a request. This is why you hear people referring to Solr as favoring consistency over write-availability. If you have 10 active replicas for a shard, then all 10 must accept the update or it fails, there's no concept of tunable consistency on a write in Solr. Failed / offline replicas are obviously ignored and they will sync up with the leader once they are back online. Cheers, Tim On Thu, Apr 18, 2013 at 4:48 PM, Furkan KAMACI furkankam...@gmail.comwrote: What are the pros and cons Having More Replica at SolrCloud? Also there is a point that I want to learn. When a request come to a leader. Does it forwards it to a replica. And if forwards it to replica, does replica works parallel to build up the index with other replicas of its same leader?
Updating clusterstate from the zookeeper
Hello, After creating a distributed collection on several different servers I sometimes get to deal with failing servers (cores appear not available = grey) or failing cores (Down / unable to recover = brown / red). In case i wish to delete this errorneous collection (through collection API) only the green nodes get erased, leaving a meaningless unavailable collection in the clusterstate.json. Is there any way to edit explicitly the clusterstate.json? If not, how do i update it so the collection as above gets deleted? Cheers, Manu
Re: What are the pros and cons Having More Replica at SolrCloud
On the query side, another down side i see would be that for a given memory pool, you'd have to share it with more cores because every replica uses it's own cache. True for the inner solr caching (JVM's heap) and OS caching as well. Adding a replicated core creates a new data set (index) that will be accessed while queried. If your replication adds a core of shard1 on a server that includes only shard2, the OS caching and solr caching would have to share the RAM with totally different memory parts (as files and query results for different shards are different) so it's clear. In the second case, if you add a replicated core to a server that already contains shard1, I'm not sure. There might be benefits if JVM handles its caches per shard and not per core, but the OS caching would differentiate between the different replications of same index and try to add both index files on memory. Cheers, Manu So if you're short on memory or queries are alike (have high hit ration) you may better take advantage of your RAM usage than splitting it to many replications. On Fri, Apr 19, 2013 at 3:08 AM, Timothy Potter thelabd...@gmail.comwrote: re: more replicas - pro: you can scale your query processing workload because you have more nodes available to service queries, eg 1,000 QPS sent to Solr with 5 replicas, then each is only processing roughly 200 QPS. If you need to scale up to 10K QPS, then add more replicas to distribute the increased workload con: additional overhead (mostly network I/O) when indexing, shard leader has to send N additional requests per update where N is the number of replicas per shard. This seems minor unless you have many replicas per shard. I can't think of any cons of having more replicas on the query side As for your other question, when the leader receives an update request, it forwards to all replicas in the active or recovering state in parallel and waits for their response before responding to the client. All replicas must accept the update for it to be considered successful, i.e. all replicas and the leader must be in agreement on the status of a request. This is why you hear people referring to Solr as favoring consistency over write-availability. If you have 10 active replicas for a shard, then all 10 must accept the update or it fails, there's no concept of tunable consistency on a write in Solr. Failed / offline replicas are obviously ignored and they will sync up with the leader once they are back online. Cheers, Tim On Thu, Apr 18, 2013 at 4:48 PM, Furkan KAMACI furkankam...@gmail.com wrote: What are the pros and cons Having More Replica at SolrCloud? Also there is a point that I want to learn. When a request come to a leader. Does it forwards it to a replica. And if forwards it to replica, does replica works parallel to build up the index with other replicas of its same leader?
Re: Solr system and numbers
if i wanna search on subsets of number,what can i do? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-system-and-numbers-tp482519p4057134.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr system and numbers
Do you mean a range (e.g. [4 TO 17]) or a prefix (e.g. 10*)? For range you need to index it as a number. For prefix, string is probably better. Than, just use standard query parameters. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Apr 18, 2013 at 9:29 PM, uohzoaix johncho...@gmail.com wrote: if i wanna search on subsets of number,what can i do? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-system-and-numbers-tp482519p4057134.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing
you just change date filedtype to string -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-tp4057017p4057136.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SolrCloud loadbalancing, replication, and failover
I think I still don't understand something here. My concern right now is that query times are very slow for 120GB index (14s on avg), I've seen a lot of disk activity when running queries. I'm hoping that distributing that query across 2 servers is going to improve the query time, specifically I'm hoping that we can distribute that disk activity because we don't have great disks on there (yet). So, with disk IO being a factor in mind, running the query on one box, vs. across 2 *should* be a concern right? Admittedly, this is the first step in what will probably be many to try to work our query times down from 14s to what I want to be around 1s. Dave -Original Message- From: Timothy Potter [mailto:thelabd...@gmail.com] Sent: Thursday, April 18, 2013 9:16 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Hi Dave, This sounds more like a budget / deployment issue vs. anything architectural. You want 2 shards with replication so you either need sufficient capacity on each of your 2 servers to host 2 Solr instances or you need 4 servers. You need to avoid starving Solr of necessary RAM, disk performance, and CPU regardless of how you lay out the cluster otherwise performance will suffer. My guess is if each Solr had sufficient resources, you wouldn't actually notice much difference in query performance. Tim On Thu, Apr 18, 2013 at 8:03 AM, David Parks davidpark...@yahoo.com wrote: But my concern is this, when we have just 2 servers: - I want 1 to be able to take over in case the other fails, as you point out. - But when *both* servers are up I don't want the SolrCloud load balancer to have Shard1 and Replica2 do the work (as they would both reside on the same physical server). Does that make sense? I want *both* server1 server2 sharing the processing of every request, *and* I want the failover capability. I'm probably missing some bit of logic here, but I want to be sure I understand the architecture. Dave -Original Message- From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] Sent: Thursday, April 18, 2013 8:13 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover Correct. This is what you want if server 2 goes down. Otis Solr ElasticSearch Support http://sematext.com/ On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote: Step 1: distribute processing We have 2 servers in which we'll run 2 SolrCloud instances on. We'll define 2 shards so that both servers are busy for each request (improving response time of the request). Step 2: Failover We would now like to ensure that if either of the servers goes down (we're very unlucky with disks), that the other will be able to take over automatically. So we define 2 shards with a replication factor of 2. So we have: . Server 1: Shard 1, Replica 2 . Server 2: Shard 2, Replica 1 Question: But in SolrCloud, replicas are active right? So isn't it now possible that the load balancer will have Server 1 process *both* parts of a request, after all, it has both shards due to the replication, right?
DirectSolrSpellChecker : vastly varying spellcheck QTime times.
Hi! I am using SOLR 4.2.1. My solrconfig.xml contains the following: searchComponent name=MySpellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=nameMySpellchecker/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.5/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections5/int int name=minQueryLength3/int float name=maxQueryFrequency0.01/float /lst /searchComponent requestHandler name=/select class=solr.SearchHandler startup=lazy lst name=defaults int name=rows10/int str name=dfid/str str name=spellcheck.dictionaryMySpellchecker/str str name=spellcheckon/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount10/str str name=spellcheck.maxResultsForSuggest35/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultsfalse/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations1/str str name=spellcheck.collateParam.q.opAND/str /lst arr name=last-components strMySpellcheck/str /arr /requestHandler schema.xml with the spell field looks like: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 sortMissingLast=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / /analyzer /fieldType field name=spell type=text_spell indexed=true stored=false multiValued=true / copyField source=title dest=spell / copyField source=artist dest=spell / My query: http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit In this case, the intent is to correct chocolat factry with chocolate factory which exists in my spell field index. I see a QTime from the above query as somewhere between 350-400ms I run a similar query replacing the spellcheck terms to pursut hapyness whereas pursuit happyness actually exists in my spell field and I see QTime of 15-17ms . Both query produce collations correctly but there is order of magnitude difference in QTime. There is one edit per term in both cases or 2 edits in each query. The length of words in both these queries seem identical. I'd like to understand why there is this vast difference in QTime. I would appreciate any help with this since I am not sure how I can get any meaningful performance numbers and attribute the slowness to anything in particular. I also see a vast difference in QTime in another case. Replace the search terms in the above query with over cuckoo's nest, over cuccoo's nst, etc. over cuckoo's nest exists in my indexed spell field and so it should find it almost immediately. This query fails to produce any collation and takes 10seconds. While the second query over cuccoo's nst corrects the phrase and also returns in 24ms. Something does not sound right here. I would appreciate help with these. Thanks in advance. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud loadbalancing, replication, and failover
On 4/18/2013 8:12 PM, David Parks wrote: I think I still don't understand something here. My concern right now is that query times are very slow for 120GB index (14s on avg), I've seen a lot of disk activity when running queries. I'm hoping that distributing that query across 2 servers is going to improve the query time, specifically I'm hoping that we can distribute that disk activity because we don't have great disks on there (yet). So, with disk IO being a factor in mind, running the query on one box, vs. across 2 *should* be a concern right? Admittedly, this is the first step in what will probably be many to try to work our query times down from 14s to what I want to be around 1s. I went through my mailing list archive to see what all you've said about your setup. One thing that I can't seem to find is a mention of how much total RAM is in each of your servers. I apologize if it was actually there and I overlooked it. In one email thread, you wanted to know whether Solr is CPU-bound or IO-bound. Solr is heavily reliant on the index on disk, and disk I/O is the slowest piece of the puzzle. The way to get good performance out of Solr is to have enough memory that you can take the disk mostly out of the equation by having the operating system cache the index in RAM. If you don't have enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy in iowait, unable to do much real work. If you DO have enough RAM to cache all (or most) of your index, then Solr will be CPU-bound. With 120GB of total index data on each server, you would want at least 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and that Solr is the only thing running on the machine. If you have more servers and shards, you can reduce the per-server memory requirement because the amount of index data on each server would go down. I am aware of the cost associated with this kind of requirement - each of my Solr servers has 64GB. If you are sharing the server with another program, then you want to have enough RAM available for Solr's heap, Solr's data, the other program's heap, and the other program's data. Some programs (like MySQL) completely skip the OS disk cache and instead do that caching themselves with heap memory that's actually allocated to the program. If you're using a program like that, then you wouldn't need to count its data. Using SSDs for storage can speed things up dramatically and may reduce the total memory requirement to some degree, but even an SSD is slower than RAM. The transfer speed of RAM is faster, and from what I understand, the latency is at least an order of magnitude quicker - nanoseconds vs microseconds. In another thread, you asked about how Google gets such good response times. Although Google's software probably works differently than Solr/Lucene, when it comes right down to it, all search engines do similar jobs and have similar requirements. I would imagine that Google gets incredible response time because they have incredible amounts of RAM at their disposal that keep the important bits of their index instantly available. They have thousands of servers in each data center. I once got a look at the extent of Google's hardware in one data center - it was HUGE. I couldn't get in to examine things closely, they keep that stuff very locked down. Thanks, Shawn
RE: TooManyClauses: maxClauseCount is set to 1024
Shawn, Giving content:[* TO *] gives the same error but when I give content:[a TO z] it works fine. Can you please explain what does it mean when I give content:[a TO z]? Can I use this as workaround? The datatype of content field is text_en. Thanks again for you replies and suggestions. Regards, Sawan From: Shawn Heisey-4 [via Lucene] [mailto:ml-node+s472066n4057074...@n3.nabble.com] Sent: Friday, April 19, 2013 12:33 AM To: Sawan Verma Subject: Re: TooManyClauses: maxClauseCount is set to 1024 On 4/18/2013 11:53 AM, sawanverma wrote: Shawn, Thanks a lot for your reply. But I am confused again if the following query is complex. http://localhost:8983/solr/test/select/?q=content:*fl=contenthl=truehl.fl=contenthl.maxAnalyzedChars=31375start=64rows=1sort=obs_date%20desc I hardly know anything about highlighting, so nothing that I say here may have any relevance to your situtation at all. A query of content:* strikes me as an invalid query. If you are shooting for all documents where content exists and excluding those where it doesn't exist, I would think that 'q=content:[* TO *]' (the TO must be uppercase) would be a better option. Exactly how your query gets expanded into a something that exceeds maxBooleanClauses is a complete mystery to me, and probably does have something to do with the highlighting. Thanks, Shawn If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057074.html To unsubscribe from TooManyClauses: maxClauseCount is set to 1024, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4056965code=c2F3YW4udmVybWFAZ2xhc3NiZWFtLmNvbXw0MDU2OTY1fC0xMTI5MDQ2NDY1. NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/TooManyClauses-maxClauseCount-is-set-to-1024-tp4056965p4057181.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TooManyClauses: maxClauseCount is set to 1024
On 4/18/2013 11:02 PM, sawanverma wrote: Giving content:[* TO *] gives the same error but when I give content:[a TO z] it works fine. Can you please explain what does it mean when I give content:[a TO z]? Can I use this as workaround? The datatype of content field is text_en. That syntax is a range query. The [* TO *] basically means that you are requesting all documents where the content field exists (has a value). It's not very likely that [a TO z] will include all possible documents - it would not include a value like zap for instance, because alphabetically, that is after z. I am a little bit confused - why would you want to do highlighting on a query that matches all documents that contain the content field, or even all documents? The point of highlighting is to show the parts of the text that matched your query text, but you don't have any query text. I think it may be time to back up and tell us what you want to actually accomplish, rather than trying to deal directly with the error message. Because it has to do with highlighting, I may not be able to help, but there are plenty of very smart people here who do understand highlighting. Thanks, Shawn
RE: SolrCloud loadbalancing, replication, and failover
Wow! That was the most pointed, concise discussion of hardware requirements I've seen to date, and it's fabulously helpful, thank you Shawn! We currently have 2 servers that I can dedicate about 12GB of ram to Solr on (we're moving to these 2 servers now). I can upgrade further if it's needed justified, and your discussion helps me justify that such an upgrade is the right thing to do. So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should be in the free and clear then right? This seems reasonable and doable. In this more extreme example the failover properties of solr cloud become more clear. I couldn't possibly run a replica shard without doubling the memory, so really replication isn't reasonable until I have double the hardware, then the load balancing scheme makes perfect sense. With 3 servers, 50GB of RAM and 120GB index I should just backup the index directory I think. My previous though to run replication just for failover would have actually resulted in LOWER performance because I would have halved the memory available to the master replica. So the previous question is answered as well now. Question: if I had 1 server with 60GB of memory and 120GB index, would solr make full use of the 60GB of memory? Thus trimming disk access in half. Or is it an all-or-nothing thing? In a dev environment, I didn't notice SOLR consuming the full 5GB of RAM assigned to it with a 120GB index. Dave -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Friday, April 19, 2013 11:51 AM To: solr-user@lucene.apache.org Subject: Re: SolrCloud loadbalancing, replication, and failover On 4/18/2013 8:12 PM, David Parks wrote: I think I still don't understand something here. My concern right now is that query times are very slow for 120GB index (14s on avg), I've seen a lot of disk activity when running queries. I'm hoping that distributing that query across 2 servers is going to improve the query time, specifically I'm hoping that we can distribute that disk activity because we don't have great disks on there (yet). So, with disk IO being a factor in mind, running the query on one box, vs. across 2 *should* be a concern right? Admittedly, this is the first step in what will probably be many to try to work our query times down from 14s to what I want to be around 1s. I went through my mailing list archive to see what all you've said about your setup. One thing that I can't seem to find is a mention of how much total RAM is in each of your servers. I apologize if it was actually there and I overlooked it. In one email thread, you wanted to know whether Solr is CPU-bound or IO-bound. Solr is heavily reliant on the index on disk, and disk I/O is the slowest piece of the puzzle. The way to get good performance out of Solr is to have enough memory that you can take the disk mostly out of the equation by having the operating system cache the index in RAM. If you don't have enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy in iowait, unable to do much real work. If you DO have enough RAM to cache all (or most) of your index, then Solr will be CPU-bound. With 120GB of total index data on each server, you would want at least 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and that Solr is the only thing running on the machine. If you have more servers and shards, you can reduce the per-server memory requirement because the amount of index data on each server would go down. I am aware of the cost associated with this kind of requirement - each of my Solr servers has 64GB. If you are sharing the server with another program, then you want to have enough RAM available for Solr's heap, Solr's data, the other program's heap, and the other program's data. Some programs (like MySQL) completely skip the OS disk cache and instead do that caching themselves with heap memory that's actually allocated to the program. If you're using a program like that, then you wouldn't need to count its data. Using SSDs for storage can speed things up dramatically and may reduce the total memory requirement to some degree, but even an SSD is slower than RAM. The transfer speed of RAM is faster, and from what I understand, the latency is at least an order of magnitude quicker - nanoseconds vs microseconds. In another thread, you asked about how Google gets such good response times. Although Google's software probably works differently than Solr/Lucene, when it comes right down to it, all search engines do similar jobs and have similar requirements. I would imagine that Google gets incredible response time because they have incredible amounts of RAM at their disposal that keep the important bits of their index instantly available. They have thousands of servers in each data center. I once got a look at the extent of Google's hardware in one data center - it was HUGE. I couldn't get in to examine things