Good point. I'd seen docValues and wondered whether they might be of use in this situation. However, as I understand it they require a value to be set for all documents until Solr 4.5. Is that true or was I imagining reading that?
On 25 September 2013 11:36, Erick Erickson <erickerick...@gmail.com> wrote: > Hmmmm, I confess I haven't had a chance to play with this yet, > but have you considered docValues for some of your fields? See: > http://wiki.apache.org/solr/DocValues > > And just to tantalize you: > > > Since Solr4.2 to build a forward index for a field, for purposes of > sorting, faceting, grouping, function queries, etc. > > > You can specify a different docValuesFormat on the fieldType > (docValuesFormat="Disk") to only load minimal data on the heap, keeping > other data structures on disk. > > Do note, though: > > Not a huge improvement for a static index > > this latter isn't a problem though since you don't have a static index.... > > Erick > > On Tue, Sep 24, 2013 at 4:13 AM, Neil Prosser <neil.pros...@gmail.com> > wrote: > > Shawn: unfortunately the current problems are with facet.method=enum! > > > > Erick: We already round our date queries so they're the same for at least > > an hour so thankfully our fq entries will be reusable. However, I'll > take a > > look at reducing the cache and autowarming counts and see what the effect > > on hit ratios and performance are. > > > > For SolrCloud our soft commit (openSearcher=false) interval is 15 seconds > > and our hard commit is 15 minutes. > > > > You're right about those sorted fields having a lot of unique values. > They > > can be any number between 0 and 10,000,000 (it's sparsely populated > across > > the documents) and could appear in several variants across multiple > > documents. This is probably a good area for seeing what we can bend with > > regard to our requirements for sorting/boosting. I've just looked at two > > shards and they've each got upwards of 1000 terms showing in the schema > > browser for one (potentially out of 60) fields. > > > > > > > > On 21 September 2013 20:07, Erick Erickson <erickerick...@gmail.com> > wrote: > > > >> About caches. The queryResultCache is only useful when you expect there > >> to be a number of _identical_ queries. Think of this cache as a map > where > >> the key is the query and the value is just a list of N document IDs > >> (internal) > >> where N is your window size. Paging is often the place where this is > used. > >> Take a look at your admin page for this cache, you can see the hit > rates. > >> But, the take-away is that this is a very small cache memory-wise, > varying > >> it is probably not a great predictor of memory usage. > >> > >> The filterCache is more intense memory wise, it's another map where the > >> key is the fq clause and the value is bounded by maxDoc/8. Take a > >> close look at this in the admin screen and see what the hit ratio is. It > >> may > >> be that you can make it much smaller and still get a lot of benefit. > >> _Especially_ considering it could occupy about 44G of memory. > >> (43,000,000 / 8) * 8192........ And the autowarm count is excessive in > >> most cases from what I've seen. Cutting the autowarm down to, say, 16 > >> may not make a noticeable difference in your response time. And if > >> you're using NOW in your fq clauses, it's almost totally useless, see: > >> http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/ > >> > >> Also, read Uwe's excellent blog about MMapDirectory here: > >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > >> for some problems with over-allocating memory to the JVM. Of course > >> if you're hitting OOMs, well..... > >> > >> bq: order them by one of their fields. > >> This is one place I'd look first. How many unique values are in each > field > >> that you sort on? This is one of the major memory consumers. You can > >> get a sense of this by looking at admin/schema-browser and selecting > >> the fields you sort on. There's a text box with the number of terms > >> returned, > >> then a / ### where ### is the total count of unique terms in the field. > >> NOTE: > >> in 4.4 this will be -1 for multiValued fields, but you shouldn't be > >> sorting on > >> those anyway. How many fields are you sorting on anyway, and of what > types? > >> > >> For your SolrCloud experiments, what are your soft and hard commit > >> intervals? > >> Because something is really screwy here. Your sharding moving the > >> number of docs down this low per shard should be fast. Back to the point > >> above, the only good explanation I can come up with from this remove is > >> that the fields you sort on have a LOT of unique values. It's possible > that > >> the total number of unique values isn't scaling with sharding. That is, > >> each > >> shard may have, say, 90% of all unique terms (number from thin air). > Worth > >> checking anyway, but a stretch. > >> > >> This is definitely unusual... > >> > >> Best, > >> Erick > >> > >> > >> On Thu, Sep 19, 2013 at 8:20 AM, Neil Prosser <neil.pros...@gmail.com> > >> wrote: > >> > Apologies for the giant email. Hopefully it makes sense. > >> > > >> > We've been trying out SolrCloud to solve some scalability issues with > our > >> > current setup and have run into problems. I'd like to describe our > >> current > >> > setup, our queries and the sort of load we see and am hoping someone > >> might > >> > be able to spot the massive flaw in the way I've been trying to set > >> things > >> > up. > >> > > >> > We currently run Solr 4.0.0 in the old style Master/Slave > replication. We > >> > have five slaves, each running Centos with 96GB of RAM, 24 cores and > with > >> > 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs) > >> but > >> > aren't slow either. Our GC parameters aren't particularly exciting, > just > >> > -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11. > >> > > >> > Our index size ranges between 144GB and 200GB (when we optimise it > back > >> > down, since we've had bad experiences with large cores). We've got > just > >> > over 37M documents some are smallish but most range between 1000-6000 > >> > bytes. We regularly update documents so large portions of the index > will > >> be > >> > touched leading to a maxDocs value of around 43M. > >> > > >> > Query load ranges between 400req/s to 800req/s across the five slaves > >> > throughout the day, increasing and decreasing gradually over a period > of > >> > hours, rather than bursting. > >> > > >> > Most of our documents have upwards of twenty fields. We use different > >> > fields to store territory variant (we have around 30 territories) > values > >> > and also boost based on the values in some of these fields (integer > >> ones). > >> > > >> > So an average query can do a range filter by two of the territory > variant > >> > fields, filter by a non-territory variant field. Facet by a field or > two > >> > (may be territory variant). Bring back the values of 60 fields. Boost > >> query > >> > on field values of a non-territory variant field. Boost by values of > two > >> > territory-variant fields. Dismax query on up to 20 fields (with > boosts) > >> and > >> > phrase boost on those fields too. They're pretty big queries. We > don't do > >> > any index-time boosting. We try to keep things dynamic so we can alter > >> our > >> > boosts on-the-fly. > >> > > >> > Another common query is to list documents with a given set of IDs and > >> > select documents with a common reference and order them by one of > their > >> > fields. > >> > > >> > Auto-commit every 30 minutes. Replication polls every 30 minutes. > >> > > >> > Document cache: > >> > * initialSize - 32768 > >> > * size - 32768 > >> > > >> > Filter cache: > >> > * autowarmCount - 128 > >> > * initialSize - 8192 > >> > * size - 8192 > >> > > >> > Query result cache: > >> > * autowarmCount - 128 > >> > * initialSize - 8192 > >> > * size - 8192 > >> > > >> > After a replicated core has finished downloading (probably while it's > >> > warming) we see requests which usually take around 100ms taking over > 5s. > >> GC > >> > logs show concurrent mode failure. > >> > > >> > I was wondering whether anyone can help with sizing the boxes > required to > >> > split this index down into shards for use with SolrCloud and roughly > how > >> > much memory we should be assigning to the JVM. Everything I've read > >> > suggests that running with a 48GB heap is way too high but every > attempt > >> > I've made to reduce the cache sizes seems to wind up causing > >> out-of-memory > >> > problems. Even dropping all cache sizes by 50% and reducing the heap > by > >> 50% > >> > caused problems. > >> > > >> > I've already tried using SolrCloud 10 shards (around 3.7M documents > per > >> > shard, each with one replica) and kept the cache sizes low: > >> > > >> > Document cache: > >> > * initialSize - 1024 > >> > * size - 1024 > >> > > >> > Filter cache: > >> > * autowarmCount - 128 > >> > * initialSize - 512 > >> > * size - 512 > >> > > >> > Query result cache: > >> > * autowarmCount - 32 > >> > * initialSize - 128 > >> > * size - 128 > >> > > >> > Even when running on six machines in AWS with SSDs, 24GB heap (out of > >> 60GB > >> > memory) and four shards on two boxes and three on the rest I still see > >> > concurrent mode failure. This looks like it's causing ZooKeeper to > mark > >> the > >> > node as down and things begin to struggle. > >> > > >> > Is concurrent mode failure just something that will inevitably happen > or > >> is > >> > it avoidable by dropping the CMSInitiatingOccupancyFraction? > >> > > >> > If anyone has anything that might shove me in the right direction I'd > be > >> > very grateful. I'm wondering whether our set-up will just never work > and > >> > maybe we're expecting too much. > >> > > >> > Many thanks, > >> > > >> > Neil > >> >