Re: Solr as am html cache

Erick Erickson Tue, 22 Nov 2016 08:25:36 -0800

bq: This seems like it might even be a good approach for creating
additional cores primarily for the purpose of caching


I think you're making it too complex, especially for such a small data set ;)

1> All the data is memory mapped anyway, so what's not in the JVM will
be in the OS's
memory eventually (assuming you have enough physical memory). See:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
If you don't have enough physical memory for that to happen adding
another core won't
help.

2> You can set your documentCache in solrconfig.xml high enough that
it'll cache that all your
documents _uncompressed_, memory permitting in 2 minutes of changing
your solrconfig.xml
file.

3> My challenge is always to measure before you code. My intuition is
that if you quantify
the potential gains of going to more complex caching they'll be
insignificant; not worth
the development time. Can't argue with measurements though.

FWIW,
Erick

On Mon, Nov 21, 2016 at 11:56 PM, Aristedes Maniatis <a...@maniatis.org> wrote:
> Thanks Erick
>
> Very helpful indeed.
>
> Your guesses on data size are about right. There might only be 50,000 items 
> in the whole index. And typically we'd fetch a batch of 10. Disk is cheap and 
> this really isn't taking much room anyway. For such a tiny data set, it seems 
> like this approach will work well.
>
>
> This seems like it might even be a good approach for creating additional 
> cores primarily for the purpose of caching: that is, a core full of records 
> that are only ever queries by some unique key. I wouldn't want to abuse Solr 
> for a purpose it wasn't designed, but since it is already there it appears to 
> be a useful approach. Rather than getting some data from the db, we fetch it 
> from Solr pre-assembled.
>
> Thanks
> Ari
>
>
>
> On 22/11/16 3:28am, Erick Erickson wrote:
>> Searching isn't really going to be impacted much, if at all. You're
>> essentially talking about setting some field with store="true" and
>> stuffing the HTML into that, right? It will probably have indexed="false"
>> and docValues="false".
>>
>> So.. what that means is that very early in the indexing process, the
>> raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
>> are totally irrelevant for querying, they aren't even read from disk to score
>> the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
>> docs are scored without having to look at the stored data at all. Now, when
>> the 10 docs are assembled for return, the stored data is read off disk
>> decompressed and returned.
>>
>> So the additional cost will be
>> 1> your index is larger on disk
>> 2> merging etc. will be a bit more costly. This doesn't
>>      seem like a problem if your index doesn't change all
>>      that often.
>> 3> there will be some additional load to decompress the data
>>      and return it.
>>
>> This is a perfectly reasonable approach, my guess is that any difference
>> in search speed will be lost in the noise of measuring and that the
>> additional load of decompressing will be more than offset by not having
>> to make a separate service call to actually get the doc, but as always
>> measuring the performance is the proof you need.
>>
>> You haven't indicated how _many_ docs you have in your corpus, but a
>> rough indication of the additional disk space is about half the raw HTML 
>> size,
>> we've usually seen about a 2:1 compression ratio. With a zillion docs
>> that could be sizeable, but disk space is cheap.
>>
>>
>> Best,
>> Erick
>>
>> On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
>> <amania...@apache.org> wrote:
>>> I'm familiar enough with 7-8 years of Solr usage in how it performs as a 
>>> full text search index, including spatial coordinates and much more. But 
>>> for the most part, we've been returning database ids from Solr rather than 
>>> a full record ready to display. We then grab the data and related records 
>>> from the database in the usual way and display it.
>>>
>>> We are thinking now about improving performance of our app. One option is 
>>> Reddis to store html pieces for reuse, rather than assembling the html from 
>>> dozens of queries to the database. We've done what we can with caching in 
>>> the ORM level, and we can't do too much with varnish because of differences 
>>> in page rendering per user (eg shopping baskets).
>>>
>>> But we are thinking about storing the rendered html directly in Solr. The 
>>> downsides appear to be:
>>>
>>> * adding 2-10kB of html to each record and the performance hit this might 
>>> have on searching and retrieving
>>> * additional load of ensuring we rebuild Solr's data every time some part 
>>> of that html changes (but this is minimal in our use case)
>>> * additional cores that we'll want to add to cache other data that isn't 
>>> yet in Solr
>>>
>>> Is this a reasonable approach to avoid running yet another cluster of 
>>> services? Are there downsides to this I haven't thought of? How does Solr 
>>> scale with record size?
>>>
>>>
>>>
>>> Cheers
>>> Ari
>>>
>>>
>>>
>>>
>>> --
>>> -------------------------->
>>> Aristedes Maniatis
>>> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A
>
>
> --
> -------------------------->
> Aristedes Maniatis
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Solr as am html cache

Reply via email to