Re: Solr using a ridiculous amount of memory
John: If you'd like to add your experience to the Wiki, create an ID and let us know what it is and we'll add you to the contributors list. Unfortunately we had problems with spam pages to we added this step. Make sure you include your logon in the request. Thanks, Erick On Fri, Jun 14, 2013 at 8:55 AM, John Nielsen j...@mcb.dk wrote: Sorry for not getting back to the list sooner. It seems like I finally solved the memory problems by following Toke's instruction of splitting the cores up into smaller chunks. After some major refactoring, our 15 cores have now turned into ~500 cores and our memory consumption has dropped dramaticly. Running 200 webshops now actually uses less memory as our 24 test shops did before. Thank you to everyone who helped, and especially to Toke. I looked at the wiki, but could not find any reference to this unintuitive way of using memory. Did I miss it somewhere? On Fri, Apr 19, 2013 at 1:30 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm. There has been quite a bit of work lately to support a couple of things that might be of interest (4.3, which Simon cut today, probably available to all mid next week at the latest). Basically, you can choose to pre-define all the cores in solr.xml (so-called old style) _or_ use the new-style solr.xml which uses auto-discover mode to walk the indicated directory and find all the cores (indicated by the presence of a 'core.properties' file). Don't know if this would make your particular case easier, and I should warn you that this is relatively new code (although there are some reasonable unit tests). You also have the option to only load the cores when they are referenced, and only keep N cores open at a time (loadOnStartup and transient properties). See: http://wiki.apache.org/solr/CoreAdmin#Configuration and http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond Note, the docs are somewhat sketchy, so if you try to go down this route let us know anything that should be improved (or you can be added to the list of wiki page contributors and help out!) Best Erick On Thu, Apr 18, 2013 at 8:31 AM, John Nielsen j...@mcb.dk wrote: You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. Wow, thank you for this awesome explanation! This is where the penny dropped for me. I will definetely move to a multi-core setup. It will take some time and a lot of re-coding. As soon as I know the result, I will let you know! -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
It was interesting to read this post. I had similar issue on Solr v4.2.1. The nature of our document is that it has huge multiValued fields and we were able to knock off out server in about 30muns We then found a bug Lucene-4995 which was causing all the problem. Applying the patch has helped a lot. Not sure related but you might want to check that out. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-using-a-ridiculous-amount-of-memory-tp4050840p4070803.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr using a ridiculous amount of memory
Yeah, this is yet another anti-pattern we need to be discouraging - large multivalued fields. They indicate that the data model is not well balanced and aligned with the strengths of Solr and Lucene. -- Jack Krupansky -Original Message- From: adityab Sent: Sunday, June 16, 2013 9:36 AM To: solr-user@lucene.apache.org Subject: Re: Solr using a ridiculous amount of memory It was interesting to read this post. I had similar issue on Solr v4.2.1. The nature of our document is that it has huge multiValued fields and we were able to knock off out server in about 30muns We then found a bug Lucene-4995 which was causing all the problem. Applying the patch has helped a lot. Not sure related but you might want to check that out. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-using-a-ridiculous-amount-of-memory-tp4050840p4070803.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr using a ridiculous amount of memory
Sorry for not getting back to the list sooner. It seems like I finally solved the memory problems by following Toke's instruction of splitting the cores up into smaller chunks. After some major refactoring, our 15 cores have now turned into ~500 cores and our memory consumption has dropped dramaticly. Running 200 webshops now actually uses less memory as our 24 test shops did before. Thank you to everyone who helped, and especially to Toke. I looked at the wiki, but could not find any reference to this unintuitive way of using memory. Did I miss it somewhere? On Fri, Apr 19, 2013 at 1:30 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm. There has been quite a bit of work lately to support a couple of things that might be of interest (4.3, which Simon cut today, probably available to all mid next week at the latest). Basically, you can choose to pre-define all the cores in solr.xml (so-called old style) _or_ use the new-style solr.xml which uses auto-discover mode to walk the indicated directory and find all the cores (indicated by the presence of a 'core.properties' file). Don't know if this would make your particular case easier, and I should warn you that this is relatively new code (although there are some reasonable unit tests). You also have the option to only load the cores when they are referenced, and only keep N cores open at a time (loadOnStartup and transient properties). See: http://wiki.apache.org/solr/CoreAdmin#Configuration and http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond Note, the docs are somewhat sketchy, so if you try to go down this route let us know anything that should be improved (or you can be added to the list of wiki page contributors and help out!) Best Erick On Thu, Apr 18, 2013 at 8:31 AM, John Nielsen j...@mcb.dk wrote: You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. Wow, thank you for this awesome explanation! This is where the penny dropped for me. I will definetely move to a multi-core setup. It will take some time and a lot of re-coding. As soon as I know the result, I will let you know! -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
On Fri, 2013-06-14 at 14:55 +0200, John Nielsen wrote: Sorry for not getting back to the list sooner. Time not important, only feedback important (apologies to Fifth Element). After some major refactoring, our 15 cores have now turned into ~500 cores and our memory consumption has dropped dramaticly. Running 200 webshops now actually uses less memory as our 24 test shops did before. That's great to hear. One core/shop also sounds like a cleaner setup. I looked at the wiki, but could not find any reference to this unintuitive way of using memory. Did I miss it somewhere? I am not aware of a wikified explanation, but a section on Why does Solr use so much memory? with some suggestions for changes to setup would seem appropriate. You are not the first to have these kinds of problems. Thank you for closing the issue, Toke Eskildsen
Re: Solr using a ridiculous amount of memory
Hmmm. There has been quite a bit of work lately to support a couple of things that might be of interest (4.3, which Simon cut today, probably available to all mid next week at the latest). Basically, you can choose to pre-define all the cores in solr.xml (so-called old style) _or_ use the new-style solr.xml which uses auto-discover mode to walk the indicated directory and find all the cores (indicated by the presence of a 'core.properties' file). Don't know if this would make your particular case easier, and I should warn you that this is relatively new code (although there are some reasonable unit tests). You also have the option to only load the cores when they are referenced, and only keep N cores open at a time (loadOnStartup and transient properties). See: http://wiki.apache.org/solr/CoreAdmin#Configuration and http://wiki.apache.org/solr/Solr.xml%204.3%20and%20beyond Note, the docs are somewhat sketchy, so if you try to go down this route let us know anything that should be improved (or you can be added to the list of wiki page contributors and help out!) Best Erick On Thu, Apr 18, 2013 at 8:31 AM, John Nielsen j...@mcb.dk wrote: You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. Wow, thank you for this awesome explanation! This is where the penny dropped for me. I will definetely move to a multi-core setup. It will take some time and a lot of re-coding. As soon as I know the result, I will let you know! -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
That was strange. As you are using a multi-valued field with the new setup, they should appear there. Yes, the new field we use for faceting is a multi valued field. Can you find the facet fields in any of the other caches? Yes, here it is, in the field cache: http://screencast.com/t/mAwEnA21yL I hope you are not calling the facets with facet.method=enum? Could you paste a typical facet-enabled search request? Here is a typical example (I added newlines for readability): http://172.22.51.111:8000/solr/default1_Danish/search ?defType=edismax q=*%3a* facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_7+key%3ditemvariantoptions_int_mv_7%7ditemvariantoptions_int_mv facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_9+key%3ditemvariantoptions_int_mv_9%7ditemvariantoptions_int_mv facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_8+key%3ditemvariantoptions_int_mv_8%7ditemvariantoptions_int_mv facet.field=%7b!ex%3dtagitemvariantoptions_int_mv_2+key%3ditemvariantoptions_int_mv_2%7ditemvariantoptions_int_mv fq=site_guid%3a(10217) fq=item_type%3a(PRODUCT) fq=language_guid%3a(1) fq=item_group_1522_combination%3a(*) fq=is_searchable%3a(True) sort=item_group_1522_name_int+asc, variant_of_item_guid+asc querytype=Technical fl=feed_item_serialized facet=true group=true group.facet=true group.ngroups=true group.field=groupby_variant_of_item_guid group.sort=name+asc rows=0 Are you warming all the sort- and facet-fields? I'm sorry, I don't know. I have the field value cache commented out in my config, so... Whatever is default? Removing the custom sort fields is unfortunately quite a bit more difficult than my other facet modification. The problem is that each item can have several sort orders. The sort order to use is defined by a group number which is known ahead of time. The group number is included in the sort order field name. To solve it in the same way i solved the facet problem, I would need to be able to sort on a multi-valued field, and unless I'm wrong, I don't think that it's possible. I am quite stomped on how to fix this. On Wed, Apr 17, 2013 at 3:06 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: John Nielsen [j...@mcb.dk]: I never seriously looked at my fieldValueCache. It never seemed to get used: http://screencast.com/t/YtKw7UQfU That was strange. As you are using a multi-valued field with the new setup, they should appear there. Can you find the facet fields in any of the other caches? ...I hope you are not calling the facets with facet.method=enum? Could you paste a typical facet-enabled search request? Yep. We still do a lot of sorting on dynamic field names, so the field cache has a lot of entries. (9.411 entries as we speak. This is considerably lower than before.). You mentioned in an earlier mail that faceting on a field shared between all facet queries would bring down the memory needed. Does the same thing go for sorting? More or less. Sorting stores the raw string representations (utf-8) in memory so the number of unique values has more to say than it does for faceting. Just as with faceting, a list of pointers from documents to values (1 value/document as we are sorting) is maintained, so the overhead is something like #documents*log2(#unique_terms*average_term_length) + #unique_terms*average_term_length (where average_term_length is in bits) Caveat: This is with the index-wide sorting structure. I am fairly confident that this is what Solr uses, but I have not looked at it lately so it is possible that some memory-saving segment-based trickery has been implemented. Does those 9411 entries duplicate data between them? Sorry, I do not know. SOLR- discusses the problems with the field cache and duplication of data, but I cannot infer if it is has been solved or not. I am not familiar with the stat breakdown of the fieldCache, but it _seems_ to me that there are 2 or 3 entries for each segment for each sort field. Guesstimating further, let's say you have 30 segments in your index. Going with the guesswork, that would bring the number of sort fields to 9411/3/30 ~= 100. Looks like you use a custom sort field for each client? Extrapolating from 1.4M documents and 180 clients, let's say that there are 1.4M/180/5 unique terms for each sort-field and that their average length is 10. We thus have 1.4M*log2(1500*10*8) + 1500*10*8 bit ~= 23MB per sort field or about 4GB for all the 180 fields. With this few unique values, the doc-value structure is by far the biggest, just as with facets. As opposed to the faceting structure, this is fairly close to the actual memory usage. Switching to a single sort field would reduce the memory usage from 4GB to about 55MB. I do commit a bit more often than i should. I get these in my log file from time to time: PERFORMANCE WARNING: Overlapping onDeckSearchers=2 So 1 active searcher and 2 warming searchers. Ignoring that one of the warming searchers is highly likely to
Re: Solr using a ridiculous amount of memory
On Thu, 2013-04-18 at 08:34 +0200, John Nielsen wrote: [Toke: Can you find the facet fields in any of the other caches?] Yes, here it is, in the field cache: http://screencast.com/t/mAwEnA21yL Ah yes, mystery solved, my mistake. http://172.22.51.111:8000/solr/default1_Danish/search [...] fq=site_guid%3a(10217) This constraints to hits to a specific customer, right? Any search will only be in a single customer's data? [Toke: Are you warming all the sort- and facet-fields?] I'm sorry, I don't know. I have the field value cache commented out in my config, so... Whatever is default? (a bit shaky here) I would say not warming. You could check simply by starting solr and looking at the caches before you issue any searches. This fits the description of your searchers gradually eating memory until your JVM OOMs. Each time a new field is faceted or sorted upon, it it added to the cache. As your index is relatively small and the number of values in the single fields is small, the initialization time for a field is so short that it is not a performance problem. Memory wise is is death by a thousand cuts. If you did explicit warming of all the possible fields for sorting and faceting, your would allocate it all up front and would be sure that there would be enough memory available. But it would take much longer than your current setup. You might want to try it out (no need to fiddle with Solr setup, just make a script and fire wgets as this has the same effect). The problem is that each item can have several sort orders. The sort order to use is defined by a group number which is known ahead of time. The group number is included in the sort order field name. To solve it in the same way i solved the facet problem, I would need to be able to sort on a multi-valued field, and unless I'm wrong, I don't think that it's possible. That is correct. Three suggestions off the bat: 1) Reduce the number of sort fields by mapping names. Count the maximum number of unique sort fields for any given customer. That will be the total number of sort fields in the index. For each group number for a customer, map that number to one of the index-wide sort fields. This only works if the maximum number of unique fields is low (let's say a single field takes 50MB, so 20 fields should be okay). 2) Create a custom sorter for Solr. Create a field with all the sort values, prefixed by group ID. Create a structure (or reuse the one from Lucene) with a doc-terms map with all the terms in-memory. When sorting, extract the relevant compare-string for a document by iterating all the terms for the document and selecting the one with the right prefix. Memory wise this scales linear to the number of terms instead of the number of fields, but it would require quite some coding. 3) Switch to a layout where each customer has a dedicated core. The basic overhead is a lot larger than for a shared index, but it would make your setup largely immune to the adverse effect of many documents coupled with many facet- and sort-fields. - Toke Eskildsen, State and University Library, Denmark
Re: Solr using a ridiculous amount of memory
http://172.22.51.111:8000/solr/default1_Danish/search [...] fq=site_guid%3a(10217) This constraints to hits to a specific customer, right? Any search will only be in a single customer's data? Yes, thats right. No search from any given client ever returns anything from another client. [Toke: Are you warming all the sort- and facet-fields?] I'm sorry, I don't know. I have the field value cache commented out in my config, so... Whatever is default? (a bit shaky here) I would say not warming. You could check simply by starting solr and looking at the caches before you issue any searches. The field cache shows 0 entries at startup. On the running server, forcing a commit (and thus opening a new searcher) does not change the number of entries. The problem is that each item can have several sort orders. The sort order to use is defined by a group number which is known ahead of time. The group number is included in the sort order field name. To solve it in the same way i solved the facet problem, I would need to be able to sort on a multi-valued field, and unless I'm wrong, I don't think that it's possible. That is correct. Three suggestions off the bat: 1) Reduce the number of sort fields by mapping names. Count the maximum number of unique sort fields for any given customer. That will be the total number of sort fields in the index. For each group number for a customer, map that number to one of the index-wide sort fields. This only works if the maximum number of unique fields is low (let's say a single field takes 50MB, so 20 fields should be okay). I just checked our DB. Our worst case scenario client has over a thousand groups for sorting. Granted, it may be, probably is, an error with the data. It is an interesting idea though and I will look into this posibility. 3) Switch to a layout where each customer has a dedicated core. The basic overhead is a lot larger than for a shared index, but it would make your setup largely immune to the adverse effect of many documents coupled with many facet- and sort-fields. Now this is where my brain melts down. If I understand the fieldCache mechanism correctly (which i can see that I don't), the data used for faceting and sorting is saved in the fieldCache using a key comprised of the fields used for said faceting/sorting. That data only contains the data which is actually used for the operation. This is what the fq queries are for. So if i generate a core for each client, I would have a client specific fieldCache containing the data from that client. Wouldn't I just split up the same data into several cores? I'm afraid I don't understand how this would help. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
On Thu, 2013-04-18 at 11:59 +0200, John Nielsen wrote: Yes, thats right. No search from any given client ever returns anything from another client. Great. That makes the 1 core/client solution feasible. [No sort facet warmup is performed] [Suggestion 1: Reduce the number of sort fields by mapping] [Suggestion 3: 1 core/customer] If I understand the fieldCache mechanism correctly (which i can see that I don't), the data used for faceting and sorting is saved in the fieldCache using a key comprised of the fields used for said faceting/sorting. That data only contains the data which is actually used for the operation. This is what the fq queries are for. You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. It might help to visualize the structures as arrays of values with docID as index: String[] myValues = new String[140] takes up 1.4M * 32 bit (or more for a 64 bit machine) = 5.6MB, even when it is empty. Note: Neither String-objects, nor Java references are used for the real facet- and sort-structures, but the principle is quite the same. So if i generate a core for each client, I would have a client specific fieldCache containing the data from that client. Wouldn't I just split up the same data into several cores? The same terms, yes, but not the same references. Let's say your customer has 10K documents in the index and that there are 100 unique values, each 10 bytes long, in each group . As each group holds its own separate structure, we use the old formula to get the memory overhead: #documents*log2(#unique_terms*average_term_length) + #unique_terms*average_term_length 1.4M*log2(100*(10*8)) + 100*(10*8) bit = 1.2MB + 1KB. Note how the values themselves are just 1KB, while the nearly empty reference list takes 1.2MB. Compare this to a dedicated core with just the 10K documents: 10K*log2(100*(10*8)) + 100*(10*8) bit = 8.5KB + 1KB. The terms take up exactly the same space, but the heap requirement for the references is reduced by 99%. Now, 25GB for 180 clients means 140MB/client with your current setup. I do not know the memory overhead of running a core, but since Solr can run fine with 32MB for small indexes, it should be smaller than that. You will of course have to experiment and to measure. - Toke Eskildsen, State and University Library, Denmark
Re: Solr using a ridiculous amount of memory
You are missing an essential part: Both the facet and the sort structures needs to hold one reference for each document _in_the_full_index_, even when the document does not have any values in the fields. Wow, thank you for this awesome explanation! This is where the penny dropped for me. I will definetely move to a multi-core setup. It will take some time and a lot of re-coding. As soon as I know the result, I will let you know! -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
I managed to get this done. The facet queries now facets on a multivalue field as opposed to the dynamic field names. Unfortunately it doesn't seem to have done much difference, if any at all. Some more information that might help: The JVM memory seem to be eaten up slowly. I dont think that there is one single query that causes the problem. My test case (dumping 180 clients on top of solr) takes hours before it causes an OOM. Often a full day. The memory usage wobbles up and down, so the GC is at least partially doing its job. It still works its way up to 100% eventually. When that happens it either OOM's or it stops the world and brings the memory consumption to 10-15 gigs. I did try to facet on all products across all clients (about 1.4 mil docs) and i could not make it OOM on a server with a 4 gig jvm. This was on a dedicated test server with my test being the only traffic. I am beginning to think that this may be related to traffic volume and not just on the type of query that I do. I tried to calculate the memory requirement example you gave me above based on the change that got rid of the dynamic fields. documents = ~1.400.000 references 11.200.000 (we facet on two multivalue fields with each 4 values on average, so 1.400.000 * 2 * 4 = 11.200.000 unique values = 1.132.344 (total number of variant options across all clients. This is what we facet on) 1.400.000 * log2(11.200.000) + 1.400.000 * log2(1132344) = ~14MB per field (we have 4 fields)? I must be calculating this wrong. On Mon, Apr 15, 2013 at 2:10 PM, John Nielsen j...@mcb.dk wrote: I did a search. I have no occurrence of UnInverted in the solr logs. Another explanation for the large amount of memory presents itself if you use a single index: If each of your clients facet on at least one fields specific to the client (client123_persons or something like that), then your memory usage goes through the roof. This is exactly how we facet right now! I will definetely rewrite the relevant parts of our product to test this out before moving further down the docValues path. I will let you know as soon as I know one way or the other. On Mon, Apr 15, 2013 at 1:38 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Mon, 2013-04-15 at 10:25 +0200, John Nielsen wrote: The FieldCache is the big culprit. We do a huge amount of faceting so it seems right. Yes, you wrote that earlier. The mystery is that the math does not check out with the description you have given us. Unfortunately I am super swamped at work so I have precious little time to work on this, which is what explains my silence. No problem, we've all been there. [Band aid: More memory] The extra memory helped a lot, but it still OOM with about 180 clients using it. You stated earlier that you has a solr cluster and your total(?) index size was 35GB, with each register being between 15k and 30k. I am using the quotes to signify that it is unclear what you mean. Is your cluster multiple machines (I'm guessing no), multiple Solr's, cores, shards or maybe just a single instance prepared for later distribution? Is a register a core, shard or a simply logical part (one client's data) of the index? If each client has their own core or shard, that would mean that each client uses more than 25GB/180 bytes ~= 142MB of heap to access 35GB/180 ~= 200MB of index. That sounds quite high and you would need a very heavy facet to reach that. If you could grep UnInverted from the Solr log file and paste the entries here, that would help to clarify things. Another explanation for the large amount of memory presents itself if you use a single index: If each of your clients facet on at least one fields specific to the client (client123_persons or something like that), then your memory usage goes through the roof. Assuming an index with 10M documents, each with 5 references to a modest 10K unique values in a facet field, the simplified formula #documents*log2(#references) + #references*log2(#unique_values) bit tells us that this takes at least 110MB with field cache based faceting. 180 clients @ 110MB ~= 20GB. As that is a theoretical low, we can at least double that. This fits neatly with your new heap of 64GB. If my guessing is correct, you can solve your memory problems very easily by sharing _all_ the facet fields between your clients. This should bring your memory usage down to a few GB. You are probably already restricting their searches to their own data by filtering, so this should not influence the returned facet values and counts, as compared to separate fields. This is very similar to the thread Facets with 5000 facet fields BTW. Today I finally managed to set up a test core so I can begin to play around with docValues. If you are using a single index with the individual-facet-fields for each client approach, the DocValues will also have scaling issues, as the amount of values (of which the
RE: Solr using a ridiculous amount of memory
John Nielsen [j...@mcb.dk] wrote: I managed to get this done. The facet queries now facets on a multivalue field as opposed to the dynamic field names. Unfortunately it doesn't seem to have done much difference, if any at all. I am sorry to hear that. documents = ~1.400.000 references 11.200.000 (we facet on two multivalue fields with each 4 values on average, so 1.400.000 * 2 * 4 = 11.200.000 unique values = 1.132.344 (total number of variant options across all clients. This is what we facet on) 1.400.000 * log2(11.200.000) + 1.400.000 * log2(1132344) = ~14MB per field (we have 4 fields)? I must be calculating this wrong. No, that sounds about right. In reality you need to multiply with 3 or 4, so let's round to 50MB/field: 1.4M documents with 2 fields with 5M references/field each is not very much and should not take a lot of memory. In comparison, we facet on 12M documents with 166M references and do some other stuff (in Lucene with a different faceting implementation, but at this level it is equivalent to Solr's in terms of memory). Our heap is 3GB. I am surprised about the lack of UnInverted from your logs as it is logged on INFO level. It should also be available from the admin interface under collection/Plugin / Stats/CACHE/fieldValueCache. But I am guessing you got your numbers from that and that the list only contains the few facets you mentioned previously? It might be wise to sanity check by summing the memSizes though; they ought to take up far below 1GB. From your description, your index is small and your faceting requirements modest. A SSD-equipped laptop should be adequate as server. So we are back to math does not check out. You stated that you were unable to make a 4GB JVM OOM when you just performed faceting (I guesstimate that it will also run fine with just ½GB or at least with 1GB, based on the numbers above) and you have observed that the field cache eats the memory. This does indicate that the old caches are somehow not freed when the index is updated. That is strange as Solr should take care of that automatically. Guessing wildly: Do you issue a high frequency small updates with frequent commits? If you pause the indexing, does memory use fall back to the single GB level (You probably need to trigger a full GC to check that)? If that is the case, it might be a warmup problem with old warmups still running when new commits are triggered. Regards, Toke Eskildsen, State and University Library, Denmark
Re: Solr using a ridiculous amount of memory
I am surprised about the lack of UnInverted from your logs as it is logged on INFO level. Nope, no trace of it. No mention either in Logging - Level from the admin interface. It should also be available from the admin interface under collection/Plugin / Stats/CACHE/fieldValueCache. I never seriously looked at my fieldValueCache. It never seemed to get used: http://screencast.com/t/YtKw7UQfU You stated that you were unable to make a 4GB JVM OOM when you just performed faceting (I guesstimate that it will also run fine with just ½GB or at least with 1GB, based on the numbers above) and you have observed that the field cache eats the memory. Yep. We still do a lot of sorting on dynamic field names, so the field cache has a lot of entries. (9.411 entries as we speak. This is considerably lower than before.). You mentioned in an earlier mail that faceting on a field shared between all facet queries would bring down the memory needed. Does the same thing go for sorting? Does those 9411 entries duplicate data between them? If this is where all the memory is going, I have a lot of coding to do. Guessing wildly: Do you issue a high frequency small updates with frequent commits? If you pause the indexing, does memory use fall back to the single GB level I do commit a bit more often than i should. I get these in my log file from time to time: PERFORMANCE WARNING: Overlapping onDeckSearchers=2 The way I understand this is that two searchers are being warmed at the same time and that one will be discarded when it finishes its auto warming procedure. If the math above is correct, I would need tens of searchers auto warming in parallel to cause my problem. If I misunderstand how this works, do let me know. My indexer has a cleanup routine that deletes replay logs and other things when it has nothing to do. This includes running a commit on the solr server to make sure nothing is ever in a state where something is not written to disk anywhere. In theory it can commit once every 60 seconds, though i doubt that ever happenes. The less work the indexer has, the more often it commits. (yes i know, its on my todo list) Other than that, my autocommit settings look like this: autoCommit maxTime6/maxTime maxDocs6000/maxDocs openSearcher false/openSearcher /autoCommit The control panel says that the warm up time of the last searcher is 5574. Is that seconds or milliseconds? http://screencast.com/t/d9oIbGLCFQwl I would prefer to not turn off the indexer unless the numbers above suggests that I really should try this. Waiting for a full GC would take a long time. Unfortunately I don't know of a way to provoke a full GC on command. On Wed, Apr 17, 2013 at 11:48 AM, Toke Eskildsen t...@statsbiblioteket.dkwrote: John Nielsen [j...@mcb.dk] wrote: I managed to get this done. The facet queries now facets on a multivalue field as opposed to the dynamic field names. Unfortunately it doesn't seem to have done much difference, if any at all. I am sorry to hear that. documents = ~1.400.000 references 11.200.000 (we facet on two multivalue fields with each 4 values on average, so 1.400.000 * 2 * 4 = 11.200.000 unique values = 1.132.344 (total number of variant options across all clients. This is what we facet on) 1.400.000 * log2(11.200.000) + 1.400.000 * log2(1132344) = ~14MB per field (we have 4 fields)? I must be calculating this wrong. No, that sounds about right. In reality you need to multiply with 3 or 4, so let's round to 50MB/field: 1.4M documents with 2 fields with 5M references/field each is not very much and should not take a lot of memory. In comparison, we facet on 12M documents with 166M references and do some other stuff (in Lucene with a different faceting implementation, but at this level it is equivalent to Solr's in terms of memory). Our heap is 3GB. I am surprised about the lack of UnInverted from your logs as it is logged on INFO level. It should also be available from the admin interface under collection/Plugin / Stats/CACHE/fieldValueCache. But I am guessing you got your numbers from that and that the list only contains the few facets you mentioned previously? It might be wise to sanity check by summing the memSizes though; they ought to take up far below 1GB. From your description, your index is small and your faceting requirements modest. A SSD-equipped laptop should be adequate as server. So we are back to math does not check out. You stated that you were unable to make a 4GB JVM OOM when you just performed faceting (I guesstimate that it will also run fine with just ½GB or at least with 1GB, based on the numbers above) and you have observed that the field cache eats the memory. This does indicate that the old caches are somehow not freed when the index is updated. That is strange as Solr should take care of that automatically. Guessing wildly: Do you issue a high frequency small updates with frequent commits? If you pause the
RE: Solr using a ridiculous amount of memory
John Nielsen [j...@mcb.dk]: I never seriously looked at my fieldValueCache. It never seemed to get used: http://screencast.com/t/YtKw7UQfU That was strange. As you are using a multi-valued field with the new setup, they should appear there. Can you find the facet fields in any of the other caches? ...I hope you are not calling the facets with facet.method=enum? Could you paste a typical facet-enabled search request? Yep. We still do a lot of sorting on dynamic field names, so the field cache has a lot of entries. (9.411 entries as we speak. This is considerably lower than before.). You mentioned in an earlier mail that faceting on a field shared between all facet queries would bring down the memory needed. Does the same thing go for sorting? More or less. Sorting stores the raw string representations (utf-8) in memory so the number of unique values has more to say than it does for faceting. Just as with faceting, a list of pointers from documents to values (1 value/document as we are sorting) is maintained, so the overhead is something like #documents*log2(#unique_terms*average_term_length) + #unique_terms*average_term_length (where average_term_length is in bits) Caveat: This is with the index-wide sorting structure. I am fairly confident that this is what Solr uses, but I have not looked at it lately so it is possible that some memory-saving segment-based trickery has been implemented. Does those 9411 entries duplicate data between them? Sorry, I do not know. SOLR- discusses the problems with the field cache and duplication of data, but I cannot infer if it is has been solved or not. I am not familiar with the stat breakdown of the fieldCache, but it _seems_ to me that there are 2 or 3 entries for each segment for each sort field. Guesstimating further, let's say you have 30 segments in your index. Going with the guesswork, that would bring the number of sort fields to 9411/3/30 ~= 100. Looks like you use a custom sort field for each client? Extrapolating from 1.4M documents and 180 clients, let's say that there are 1.4M/180/5 unique terms for each sort-field and that their average length is 10. We thus have 1.4M*log2(1500*10*8) + 1500*10*8 bit ~= 23MB per sort field or about 4GB for all the 180 fields. With this few unique values, the doc-value structure is by far the biggest, just as with facets. As opposed to the faceting structure, this is fairly close to the actual memory usage. Switching to a single sort field would reduce the memory usage from 4GB to about 55MB. I do commit a bit more often than i should. I get these in my log file from time to time: PERFORMANCE WARNING: Overlapping onDeckSearchers=2 So 1 active searcher and 2 warming searchers. Ignoring that one of the warming searchers is highly likely to finish well ahead of the other one, that means that your heap must hold 3 times the structures for a single searcher. With the old heap size of 25GB that left only 8GB for a full dataset. Subtract the 4GB for sorting and a similar amount for faceting and you have your OOM. Tweaking your ingest to avoid 3 overlapping searchers will lower your memory requirements by 1/3. Fixing the facet sorting logic will bring it down to laptop size. The control panel says that the warm up time of the last searcher is 5574. Is that seconds or milliseconds? http://screencast.com/t/d9oIbGLCFQwl milliseconds, I am fairly sure. It is much faster than I anticipated. Are you warming all the sort- and facet-fields? Waiting for a full GC would take a long time. Until you have fixed the core memory issue, you might consider doing an explicit GC every night to clean up and hope that it does not occur automatically at daytime (or whenever your clients uses it). Unfortunately I don't know of a way to provoke a full GC on command. VisualVM, which is delivered with the Oracle JDK (look somewhere in the bin folder), is your friend. Just start it on the server and click on the relevant process. Regards, Toke Eskildsen
RE: Solr using a ridiculous amount of memory
Whopps. I made some mistakes in the previous post. Toke Eskildsen [t...@statsbiblioteket.dk]: Extrapolating from 1.4M documents and 180 clients, let's say that there are 1.4M/180/5 unique terms for each sort-field and that their average length is 10. We thus have 1.4M*log2(1500*10*8) + 1500*10*8 bit ~= 23MB per sort field or about 4GB for all the 180 fields. That would be 10 bytes and thus 80 bits. The results were correct though. So 1 active searcher and 2 warming searchers. Ignoring that one of the warming searchers is highly likely to finish well ahead of the other one, that means that your heap must hold 3 times the structures for a single searcher. This should be taken with a grain of salt as it depends on whether or not there is any re-use of segments. There might be for sorting. Apologies for any confusion, Toke Eskildsen
Re: Solr using a ridiculous amount of memory
On Sun, 2013-03-24 at 09:19 +0100, John Nielsen wrote: Our memory requirements are running amok. We have less than a quarter of our customers running now and even though we have allocated 25GB to the JVM already, we are still seeing daily OOM crashes. Out of curiosity: Did you manage to pinpoint the memory eater in your setup? - Toke Eskildsen
Re: Solr using a ridiculous amount of memory
Yes and no, The FieldCache is the big culprit. We do a huge amount of faceting so it seems right. Unfortunately I am super swamped at work so I have precious little time to work on this, which is what explains my silence. Out of desperation, I added another 32G of memory to each server and increased the JVM size to 64G from 25G. The servers are running with 96G memory right now (this is the max amount supported by the hardware) which leaves solr somewhat starved for memory. I am aware of the performance implications of doing this but I have little choice. The extra memory helped a lot, but it still OOM with about 180 clients using it. Unfortunately I need to support at least double that. After upgrading the RAM, I ran for almost two weeks with the same workload that used to OOM a couple of times a day, so it doesn't look like a leak. Today I finally managed to set up a test core so I can begin to play around with docValues. I actually have a couple of questions regarding docValues: 1) If I facet on multible fields and only some of those fields are using docValues, will I still get the memory saving benefit of docValues? (one of the facet fields use null values and will require a lot of work in our product to fix) 2) If i just use docValues on one small core with very limited traffic at first for testing purposes, how can I test that it is actually using the disk for caching? I really appreciate all the help I have received on this list so far. I do feel confident that I will be able to solve this issue eventually. On Mon, Apr 15, 2013 at 9:00 AM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Sun, 2013-03-24 at 09:19 +0100, John Nielsen wrote: Our memory requirements are running amok. We have less than a quarter of our customers running now and even though we have allocated 25GB to the JVM already, we are still seeing daily OOM crashes. Out of curiosity: Did you manage to pinpoint the memory eater in your setup? - Toke Eskildsen -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
On Mon, 2013-04-15 at 10:25 +0200, John Nielsen wrote: The FieldCache is the big culprit. We do a huge amount of faceting so it seems right. Yes, you wrote that earlier. The mystery is that the math does not check out with the description you have given us. Unfortunately I am super swamped at work so I have precious little time to work on this, which is what explains my silence. No problem, we've all been there. [Band aid: More memory] The extra memory helped a lot, but it still OOM with about 180 clients using it. You stated earlier that you has a solr cluster and your total(?) index size was 35GB, with each register being between 15k and 30k. I am using the quotes to signify that it is unclear what you mean. Is your cluster multiple machines (I'm guessing no), multiple Solr's, cores, shards or maybe just a single instance prepared for later distribution? Is a register a core, shard or a simply logical part (one client's data) of the index? If each client has their own core or shard, that would mean that each client uses more than 25GB/180 bytes ~= 142MB of heap to access 35GB/180 ~= 200MB of index. That sounds quite high and you would need a very heavy facet to reach that. If you could grep UnInverted from the Solr log file and paste the entries here, that would help to clarify things. Another explanation for the large amount of memory presents itself if you use a single index: If each of your clients facet on at least one fields specific to the client (client123_persons or something like that), then your memory usage goes through the roof. Assuming an index with 10M documents, each with 5 references to a modest 10K unique values in a facet field, the simplified formula #documents*log2(#references) + #references*log2(#unique_values) bit tells us that this takes at least 110MB with field cache based faceting. 180 clients @ 110MB ~= 20GB. As that is a theoretical low, we can at least double that. This fits neatly with your new heap of 64GB. If my guessing is correct, you can solve your memory problems very easily by sharing _all_ the facet fields between your clients. This should bring your memory usage down to a few GB. You are probably already restricting their searches to their own data by filtering, so this should not influence the returned facet values and counts, as compared to separate fields. This is very similar to the thread Facets with 5000 facet fields BTW. Today I finally managed to set up a test core so I can begin to play around with docValues. If you are using a single index with the individual-facet-fields for each client approach, the DocValues will also have scaling issues, as the amount of values (of which the majority will be null) will be #clients*#documents*#facet_fields This means that the adding a new client will be progressively more expensive. On the other hand, if you use a lot of small shards, DocValues should work for you. Regards, Toke Eskildsen
Re: Solr using a ridiculous amount of memory
I did a search. I have no occurrence of UnInverted in the solr logs. Another explanation for the large amount of memory presents itself if you use a single index: If each of your clients facet on at least one fields specific to the client (client123_persons or something like that), then your memory usage goes through the roof. This is exactly how we facet right now! I will definetely rewrite the relevant parts of our product to test this out before moving further down the docValues path. I will let you know as soon as I know one way or the other. On Mon, Apr 15, 2013 at 1:38 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Mon, 2013-04-15 at 10:25 +0200, John Nielsen wrote: The FieldCache is the big culprit. We do a huge amount of faceting so it seems right. Yes, you wrote that earlier. The mystery is that the math does not check out with the description you have given us. Unfortunately I am super swamped at work so I have precious little time to work on this, which is what explains my silence. No problem, we've all been there. [Band aid: More memory] The extra memory helped a lot, but it still OOM with about 180 clients using it. You stated earlier that you has a solr cluster and your total(?) index size was 35GB, with each register being between 15k and 30k. I am using the quotes to signify that it is unclear what you mean. Is your cluster multiple machines (I'm guessing no), multiple Solr's, cores, shards or maybe just a single instance prepared for later distribution? Is a register a core, shard or a simply logical part (one client's data) of the index? If each client has their own core or shard, that would mean that each client uses more than 25GB/180 bytes ~= 142MB of heap to access 35GB/180 ~= 200MB of index. That sounds quite high and you would need a very heavy facet to reach that. If you could grep UnInverted from the Solr log file and paste the entries here, that would help to clarify things. Another explanation for the large amount of memory presents itself if you use a single index: If each of your clients facet on at least one fields specific to the client (client123_persons or something like that), then your memory usage goes through the roof. Assuming an index with 10M documents, each with 5 references to a modest 10K unique values in a facet field, the simplified formula #documents*log2(#references) + #references*log2(#unique_values) bit tells us that this takes at least 110MB with field cache based faceting. 180 clients @ 110MB ~= 20GB. As that is a theoretical low, we can at least double that. This fits neatly with your new heap of 64GB. If my guessing is correct, you can solve your memory problems very easily by sharing _all_ the facet fields between your clients. This should bring your memory usage down to a few GB. You are probably already restricting their searches to their own data by filtering, so this should not influence the returned facet values and counts, as compared to separate fields. This is very similar to the thread Facets with 5000 facet fields BTW. Today I finally managed to set up a test core so I can begin to play around with docValues. If you are using a single index with the individual-facet-fields for each client approach, the DocValues will also have scaling issues, as the amount of values (of which the majority will be null) will be #clients*#documents*#facet_fields This means that the adding a new client will be progressively more expensive. On the other hand, if you use a lot of small shards, DocValues should work for you. Regards, Toke Eskildsen -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
Might be obvious, but just in case - remember that you'll need to re-index your content once you've added docValues to your schema, in order to get the on-disk files to be created. Upayavira On Mon, Mar 25, 2013, at 03:16 PM, John Nielsen wrote: I apologize for the slow reply. Today has been killer. I will reply to everyone as soon as I get the time. I am having difficulties understanding how docValues work. Should I only add docValues to the fields that I actually use for sorting and faceting or on all fields? Will the docValues magic apply to the fields i activate docValues on or on the entire document when sorting/faceting on a field that has docValues activated? I'm not even sure which question to ask. I am struggling to understand this on a conceptual level. On Sun, Mar 24, 2013 at 7:11 PM, Robert Muir rcm...@gmail.com wrote: On Sun, Mar 24, 2013 at 4:19 AM, John Nielsen j...@mcb.dk wrote: Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW This schema isn't using docvalues, due to a typo in your config. it should not be DocValues=true but docValues=true. Are you not getting an error? Solr needs to throw exception if you provide invalid attributes to the field. Nothing is more frustrating than having a typo or something in your configuration and solr just ignores this, reports no error, and doesnt work the way you want. I'll look into this (I already intend to add these checks to analysis factories for the same reason). Separately, if you really want the terms data and so on to remain on disk, it is not enough to just enable docvalues for the field. The default implementation uses the heap. So if you want that, you need to set docValuesFormat=Disk on the fieldtype. This will keep the majority of the data on disk, and only some key datastructures in heap memory. This might have significant performance impact depending upon what you are doing so you need to test that. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
I apologize for the slow reply. Today has been killer. I will reply to everyone as soon as I get the time. I am having difficulties understanding how docValues work. Should I only add docValues to the fields that I actually use for sorting and faceting or on all fields? Will the docValues magic apply to the fields i activate docValues on or on the entire document when sorting/faceting on a field that has docValues activated? I'm not even sure which question to ask. I am struggling to understand this on a conceptual level. On Sun, Mar 24, 2013 at 7:11 PM, Robert Muir rcm...@gmail.com wrote: On Sun, Mar 24, 2013 at 4:19 AM, John Nielsen j...@mcb.dk wrote: Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW This schema isn't using docvalues, due to a typo in your config. it should not be DocValues=true but docValues=true. Are you not getting an error? Solr needs to throw exception if you provide invalid attributes to the field. Nothing is more frustrating than having a typo or something in your configuration and solr just ignores this, reports no error, and doesnt work the way you want. I'll look into this (I already intend to add these checks to analysis factories for the same reason). Separately, if you really want the terms data and so on to remain on disk, it is not enough to just enable docvalues for the field. The default implementation uses the heap. So if you want that, you need to set docValuesFormat=Disk on the fieldtype. This will keep the majority of the data on disk, and only some key datastructures in heap memory. This might have significant performance impact depending upon what you are doing so you need to test that. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Solr using a ridiculous amount of memory
Hello all, We are running a solr cluster which is now running solr-4.2. The index is about 35GB on disk with each register between 15k and 30k. (This is simply the size of a full xml reply of one register. I'm not sure how to measure it otherwise.) Our memory requirements are running amok. We have less than a quarter of our customers running now and even though we have allocated 25GB to the JVM already, we are still seeing daily OOM crashes. We used to just allocate more memory to the JVM, but with the way solr is scaling, we would need well over 100GB of memory on each node to finish the project, and thats just not going to happen. I need to lower the memory requirements somehow. I can see from the memory dumps we've done that the field cache is by far the biggest sinner. Of special interest to me is the recent introduction of DocValues which supposedly mitigates this issue by using memory outside the JVM. I just can't, because of lack of documentation, seem to make it work. We do a lot of facetting. One client facets on about 50.000 docs of approx 30k each on 5 fields. I understand that this is VERY memory intensive. Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW The cache is pretty well tuned. Any lower and i get evictions. Come hell or high water, my JVM memory requirements must come down. Simply moving some memory load outside of the JVM would be awesome! Making it not use the field cache for anything would also (probably) work for me. I thought about killing off my other caches, but from the dumps, they just don't seem to use that much memory. I am at my wits end. Any help would be sorely appreciated. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
Just to get started, do you hit OOM quickly with a few expensive queries, or is it after a number of hours and lots of queries? Does Java heap usage seem to be growing linearly as queries come in, or are there big spikes? How complex/rich are your queries (e.g., how many terms, wildcards, faceted fields, sorting, etc.)? As a baseline experiment, start a Solr server, see how much Java heap is used/available. Then do a couple of typical queries, and check the heap size again. Then do a couple more similar but different (to avoid query cache matches), and check the heap again. Maybe do that a few times to get a handle on the baseline memory required and whether there might be a leak of some sort. Do enough queries to hits all of the fields, facets, sorting, etc. that are likely to be encountered in one of your typical days that hits OOM - just not the volume of queries. The goal is to determine if there is something inherently memory intensive in your index/queries, or something relating to a leak based on total query volume. -- Jack Krupansky -Original Message- From: John Nielsen Sent: Sunday, March 24, 2013 4:19 AM To: solr-user@lucene.apache.org Subject: Solr using a ridiculous amount of memory Hello all, We are running a solr cluster which is now running solr-4.2. The index is about 35GB on disk with each register between 15k and 30k. (This is simply the size of a full xml reply of one register. I'm not sure how to measure it otherwise.) Our memory requirements are running amok. We have less than a quarter of our customers running now and even though we have allocated 25GB to the JVM already, we are still seeing daily OOM crashes. We used to just allocate more memory to the JVM, but with the way solr is scaling, we would need well over 100GB of memory on each node to finish the project, and thats just not going to happen. I need to lower the memory requirements somehow. I can see from the memory dumps we've done that the field cache is by far the biggest sinner. Of special interest to me is the recent introduction of DocValues which supposedly mitigates this issue by using memory outside the JVM. I just can't, because of lack of documentation, seem to make it work. We do a lot of facetting. One client facets on about 50.000 docs of approx 30k each on 5 fields. I understand that this is VERY memory intensive. Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW The cache is pretty well tuned. Any lower and i get evictions. Come hell or high water, my JVM memory requirements must come down. Simply moving some memory load outside of the JVM would be awesome! Making it not use the field cache for anything would also (probably) work for me. I thought about killing off my other caches, but from the dumps, they just don't seem to use that much memory. I am at my wits end. Any help would be sorely appreciated. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk
Re: Solr using a ridiculous amount of memory
On Sun, Mar 24, 2013 at 4:19 AM, John Nielsen j...@mcb.dk wrote: Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW This schema isn't using docvalues, due to a typo in your config. it should not be DocValues=true but docValues=true. Are you not getting an error? Solr needs to throw exception if you provide invalid attributes to the field. Nothing is more frustrating than having a typo or something in your configuration and solr just ignores this, reports no error, and doesnt work the way you want. I'll look into this (I already intend to add these checks to analysis factories for the same reason). Separately, if you really want the terms data and so on to remain on disk, it is not enough to just enable docvalues for the field. The default implementation uses the heap. So if you want that, you need to set docValuesFormat=Disk on the fieldtype. This will keep the majority of the data on disk, and only some key datastructures in heap memory. This might have significant performance impact depending upon what you are doing so you need to test that.
RE: Solr using a ridiculous amount of memory
From: John Nielsen [j...@mcb.dk]: The index is about 35GB on disk with each register between 15k and 30k. (This is simply the size of a full xml reply of one register. I'm not sure how to measure it otherwise.) Our memory requirements are running amok. We have less than a quarter of our customers running now and even though we have allocated 25GB to the JVM already, we are still seeing daily OOM crashes. That does sound a bit peculiar. I do not understand what you mean by register though. How many documents does your index holds? I can see from the memory dumps we've done that the field cache is by far the biggest sinner. Do you sort on a lot of different fields? We do a lot of facetting. One client facets on about 50.000 docs of approx 30k each on 5 fields. I understand that this is VERY memory intensive. To get a rough approximation of memory usage, we need the total number of documents, the average number of values for each of the 5 fields for a document and the number of unique values in each of the 5 fields. The rule of thumb I use for lower ceiling is #documents*log2(#references) + #references*log2(#unique_values) bit If your whole index has 10M documents, which each has 100 values for each field, with each field having 50M unique values, then the memory requirement would be more than 10M*log2(100*10M) + 100*10M*log2(50M) bit ~= 340MB/field ~= 1.6GB for faceting on all fields. Even when we multiply that with 4 to get a more real-world memory requirement, it is far from the 25GB that you are allocating. Either you have an interestingly high number somewhere in the equation or something's off. Regards, Toke Eskildsen
RE: Solr using a ridiculous amount of memory
Toke Eskildsen [t...@statsbiblioteket.dk]: If your whole index has 10M documents, which each has 100 values for each field, with each field having 50M unique values, then the memory requirement would be more than 10M*log2(100*10M) + 100*10M*log2(50M) bit ~= 340MB/field ~= 1.6GB for faceting on all fields. Whoops. Missed a 0 when calculating. The case above would actually take more than 15GB, probably also more than the 25GB you have allocated. Anyway, I see now in your solrconfig that your main facet fields are cat, manu_exact, content_type and author_s, with the 5th being maybe price, popularity or manufacturedate_dt? cat seems like category (relatively few references, few uniques), content_type probably has a single value/item and again few uniques. No memory problem there, unless you have a lot of documents (100M-range). That leaves manu_exact and author_s. If those are freetext fields with item descriptions or similar, that might explain the OOM. Could you describe the facet fields in more detail and provide us with the total document count? Quick sanity check: If you are using a Linux server, could you please verify that your virtual memory is set to unlimited with 'ulimit -v'? Regards, Toke Eskildsen
Re: Solr using a ridiculous amount of memory
A step I meant to include was that after you warm Solr with a representative collection of queries that references all of the fields, facets, sorting, etc. that your daily load will reference, check the Java heap size at that point, and then set your Java heap limit to a moderate level higher, like 256M, restart, and then see what happens. The theory is that if you have too much available heap, Java will gradually fill it all with garbage (no leaks implied, but maybe some leaks as well), and then a Java GC will be an expensive hit, and sometimes a rapid flow of incoming requests at that point can cause Java to freak out and even hit OOM even though a more graceful garbage collection would eventually free up tons of garbage. So, by only allowing for a moderate amount of garbage, more frequent GCs will be less intensive and less likely to cause weird situations. The other part of the theory is that it is usually better to leave tons of memory to the OS for efficiently caching files, rather than force Java to manage large amounts of memory, which it typically does not do so well. -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Sunday, March 24, 2013 2:00 PM To: solr-user@lucene.apache.org Subject: Re: Solr using a ridiculous amount of memory Just to get started, do you hit OOM quickly with a few expensive queries, or is it after a number of hours and lots of queries? Does Java heap usage seem to be growing linearly as queries come in, or are there big spikes? How complex/rich are your queries (e.g., how many terms, wildcards, faceted fields, sorting, etc.)? As a baseline experiment, start a Solr server, see how much Java heap is used/available. Then do a couple of typical queries, and check the heap size again. Then do a couple more similar but different (to avoid query cache matches), and check the heap again. Maybe do that a few times to get a handle on the baseline memory required and whether there might be a leak of some sort. Do enough queries to hits all of the fields, facets, sorting, etc. that are likely to be encountered in one of your typical days that hits OOM - just not the volume of queries. The goal is to determine if there is something inherently memory intensive in your index/queries, or something relating to a leak based on total query volume. -- Jack Krupansky -Original Message- From: John Nielsen Sent: Sunday, March 24, 2013 4:19 AM To: solr-user@lucene.apache.org Subject: Solr using a ridiculous amount of memory Hello all, We are running a solr cluster which is now running solr-4.2. The index is about 35GB on disk with each register between 15k and 30k. (This is simply the size of a full xml reply of one register. I'm not sure how to measure it otherwise.) Our memory requirements are running amok. We have less than a quarter of our customers running now and even though we have allocated 25GB to the JVM already, we are still seeing daily OOM crashes. We used to just allocate more memory to the JVM, but with the way solr is scaling, we would need well over 100GB of memory on each node to finish the project, and thats just not going to happen. I need to lower the memory requirements somehow. I can see from the memory dumps we've done that the field cache is by far the biggest sinner. Of special interest to me is the recent introduction of DocValues which supposedly mitigates this issue by using memory outside the JVM. I just can't, because of lack of documentation, seem to make it work. We do a lot of facetting. One client facets on about 50.000 docs of approx 30k each on 5 fields. I understand that this is VERY memory intensive. Schema with DocValues attempt at solving problem: http://pastebin.com/Ne23NnW4 Config: http://pastebin.com/x1qykyXW The cache is pretty well tuned. Any lower and i get evictions. Come hell or high water, my JVM memory requirements must come down. Simply moving some memory load outside of the JVM would be awesome! Making it not use the field cache for anything would also (probably) work for me. I thought about killing off my other caches, but from the dumps, they just don't seem to use that much memory. I am at my wits end. Any help would be sorely appreciated. -- Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk