bq: This seems like it might even be a good approach for creating additional cores primarily for the purpose of caching
I think you're making it too complex, especially for such a small data set ;) 1> All the data is memory mapped anyway, so what's not in the JVM will be in the OS's memory eventually (assuming you have enough physical memory). See: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html If you don't have enough physical memory for that to happen adding another core won't help. 2> You can set your documentCache in solrconfig.xml high enough that it'll cache that all your documents _uncompressed_, memory permitting in 2 minutes of changing your solrconfig.xml file. 3> My challenge is always to measure before you code. My intuition is that if you quantify the potential gains of going to more complex caching they'll be insignificant; not worth the development time. Can't argue with measurements though. FWIW, Erick On Mon, Nov 21, 2016 at 11:56 PM, Aristedes Maniatis <a...@maniatis.org> wrote: > Thanks Erick > > Very helpful indeed. > > Your guesses on data size are about right. There might only be 50,000 items > in the whole index. And typically we'd fetch a batch of 10. Disk is cheap and > this really isn't taking much room anyway. For such a tiny data set, it seems > like this approach will work well. > > > This seems like it might even be a good approach for creating additional > cores primarily for the purpose of caching: that is, a core full of records > that are only ever queries by some unique key. I wouldn't want to abuse Solr > for a purpose it wasn't designed, but since it is already there it appears to > be a useful approach. Rather than getting some data from the db, we fetch it > from Solr pre-assembled. > > Thanks > Ari > > > > On 22/11/16 3:28am, Erick Erickson wrote: >> Searching isn't really going to be impacted much, if at all. You're >> essentially talking about setting some field with store="true" and >> stuffing the HTML into that, right? It will probably have indexed="false" >> and docValues="false". >> >> So.. what that means is that very early in the indexing process, the >> raw data is dumped to the *.fdt and *.fdx extensions for the segment. These >> are totally irrelevant for querying, they aren't even read from disk to score >> the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000 >> docs are scored without having to look at the stored data at all. Now, when >> the 10 docs are assembled for return, the stored data is read off disk >> decompressed and returned. >> >> So the additional cost will be >> 1> your index is larger on disk >> 2> merging etc. will be a bit more costly. This doesn't >> seem like a problem if your index doesn't change all >> that often. >> 3> there will be some additional load to decompress the data >> and return it. >> >> This is a perfectly reasonable approach, my guess is that any difference >> in search speed will be lost in the noise of measuring and that the >> additional load of decompressing will be more than offset by not having >> to make a separate service call to actually get the doc, but as always >> measuring the performance is the proof you need. >> >> You haven't indicated how _many_ docs you have in your corpus, but a >> rough indication of the additional disk space is about half the raw HTML >> size, >> we've usually seen about a 2:1 compression ratio. With a zillion docs >> that could be sizeable, but disk space is cheap. >> >> >> Best, >> Erick >> >> On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis >> <amania...@apache.org> wrote: >>> I'm familiar enough with 7-8 years of Solr usage in how it performs as a >>> full text search index, including spatial coordinates and much more. But >>> for the most part, we've been returning database ids from Solr rather than >>> a full record ready to display. We then grab the data and related records >>> from the database in the usual way and display it. >>> >>> We are thinking now about improving performance of our app. One option is >>> Reddis to store html pieces for reuse, rather than assembling the html from >>> dozens of queries to the database. We've done what we can with caching in >>> the ORM level, and we can't do too much with varnish because of differences >>> in page rendering per user (eg shopping baskets). >>> >>> But we are thinking about storing the rendered html directly in Solr. The >>> downsides appear to be: >>> >>> * adding 2-10kB of html to each record and the performance hit this might >>> have on searching and retrieving >>> * additional load of ensuring we rebuild Solr's data every time some part >>> of that html changes (but this is minimal in our use case) >>> * additional cores that we'll want to add to cache other data that isn't >>> yet in Solr >>> >>> Is this a reasonable approach to avoid running yet another cluster of >>> services? Are there downsides to this I haven't thought of? How does Solr >>> scale with record size? >>> >>> >>> >>> Cheers >>> Ari >>> >>> >>> >>> >>> -- >>> --------------------------> >>> Aristedes Maniatis >>> GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A > > > -- > --------------------------> > Aristedes Maniatis > GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A