> With very large heaps and a GC that can handle them (perhaps the G1 GC), another option which might be worth experimenting with is a value (KV) cache independent of the block cache which could be enabled on a per-table basis Thanks Andy for bringing this up. We've had some discussions some time ago about a row-cache (or KV cache) http://search-hadoop.com/m/XTlxT1xRtYw/hbase+key+value+cache+from%253Aenis&subj=RE+keyvalue+cache
The takeaway was that if you are mostly doing point gets, rather than scans, this cache might be better. > 1) [HBASE-7404]: L1/L2 block cache I knew about the Bucket cache, but not that bucket cache could hold compressed blocks. Is it the case, or are you suggesting we can add that to this L2 cache. > 2) [HBASE-5263] Preserving cached data on compactions through cache-on-write Thanks, this is the same idea. I'll track the ticket. Enis On Mon, Mar 25, 2013 at 12:18 PM, Liyin Tang <[email protected]> wrote: > Hi Enis, > Good ideas ! And hbase community is driving on these 2 items. > 1) [HBASE-7404]: L1/L2 block cache > 2) [HBASE-5263] Preserving cached data on compactions through > cache-on-write > > Thanks a lot > Liyin > ________________________________________ > From: Enis Söztutar [[email protected]] > Sent: Monday, March 25, 2013 11:24 AM > To: hbase-user > Cc: lars hofhansl > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > Thanks Liyin for sharing your use cases. > > Related to those, I was thinking of two improvements: > - AFAIK, MySQL keeps the compressed and uncompressed versions of the blocs > in its block cache, failing over the compressed one if decompressed one > gets evicted. With very large heaps, maybe keeping around the compressed > blocks in a secondary cache makes sense? > - A compaction will trash the cache. But maybe we can track keyvalues > (inside cached blocks are cached) for the files in the compaction, and mark > the blocks of the resulting compacted file which contain previously cached > keyvalues to be cached after the compaction. I have to research the > feasibility of this approach. > > Enis > > > On Sun, Mar 24, 2013 at 10:15 PM, Liyin Tang <[email protected]> wrote: > > > Block cache is for uncompressed data while OS page contains the > compressed > > data. Unless the request pattern is full-table sequential scan, the block > > cache is still quite useful. I think the size of the block cache should > be > > the amont of hot data we want to retain within a compaction cycle, which > is > > quite hard to estimate in some use cases. > > > > > > Thanks a lot > > Liyin > > ________________________________________ > > From: lars hofhansl [[email protected]] > > Sent: Saturday, March 23, 2013 10:20 PM > > To: [email protected] > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > Interesting. > > > > > 2) The blocks in the block cache will be naturally invalid quickly > after > > the compactions. > > > > Should one keep the block cache small in order to increase the OS page > > cache? > > > > Does you data suggest we should not use the block cache at all? > > > > > > Thanks. > > > > -- Lars > > > > > > > > ________________________________ > > From: Liyin Tang <[email protected]> > > To: [email protected] > > Sent: Saturday, March 23, 2013 9:44 PM > > Subject: Re: Does HBase RegionServer benefit from OS Page Cache > > > > We (Facebook) are closely monitoring the OS page cache hit ratio in the > > production environments. My experience is if your data access pattern is > > very random, then the OS page cache won't help you so much even though > the > > data locality is very high. On the other hand, if the requests are always > > against the recent data points, then the page cache hit ratio could be > much > > higher. > > > > Actually, there are lots of optimizations could be done in HDFS. For > > example, we are working on fadvice away the 2nd/3rd replicated data from > OS > > page cache so that it potentially could improve your OS page cache by 3X. > > Also, by taking advantage of the tiered-based compaction+fadvice in HDFS, > > the region server could keep more hot data in OS page cache based on the > > read access pattern. > > > > Another separate point is that we probably should NOT reply on the > > memstore/block cache to keep hot data. 1) The more data in the memstore, > > the more data the region server need to recovery from the server > failures. > > So the tradeoff is the recovery time. 2) The blocks in the block cache > will > > be naturally invalid quickly after the compactions. So region server > > probably won't be benefit from large JVM size at all. > > > > Thanks a lot > > Liyin > > > > On Sat, Mar 23, 2013 at 6:13 PM, Ted Yu <[email protected]> wrote: > > > > > Coming up is the following enhancement which would make MSLAB even > > better: > > > > > > HBASE-8163 MemStoreChunkPool: An improvement for JAVA GC when using > MSLAB > > > > > > FYI > > > > > > On Sat, Mar 23, 2013 at 5:31 PM, Pankaj Gupta <[email protected] > > > >wrote: > > > > > > > Thanks a lot for the explanation. It's good to know that MSlab is > > stable > > > > and safe to enable (we don't have it enable right now, we're using > > 0.92). > > > > This would allow us to more freely allocate memory to HBase. I really > > > > enjoyed the depth of explanation from both Enis and J-D. I was indeed > > > > mistakenly referring to HFile as HLog, fortunately you were still > able > > > > understand my question. > > > > > > > > Thanks, > > > > Pankaj > > > > On Mar 21, 2013, at 1:28 PM, Enis Söztutar <[email protected]> > wrote: > > > > > > > > > I think the page cache is not totally useless, but as long as you > can > > > > > control the GC, you should prefer the block cache. Some of the > > reasons > > > of > > > > > the top of my head: > > > > > - In case of a cache hit, for OS cache, you have to go through the > DN > > > > > layer (an RPC if ssr disabled), and do a kernel jump, and read > using > > > the > > > > > read() libc vs for reading a block from the block cache, only the > > > HBase > > > > > process is involved. There is no process switch involved and no > > kernel > > > > > jumps. > > > > > - The read access path is optimized per hfile block. FS page > > boundaries > > > > > and hfile block boundaries are not aligned at all. > > > > > - There is very little control to the page cache to cache / not > cache > > > > > based on expected access patterns. For example, we can mark META > > region > > > > > blocks, and some column families, and hfile index blocks always > > cached > > > or > > > > > cached with high priority. Also, for full table scans, we can > > > explicitly > > > > > disable block caching to not trash the current working set. With OS > > > page > > > > > cache, you do not have this control. > > > > > > > > > > Enis > > > > > > > > > > > > > > > On Wed, Mar 20, 2013 at 10:30 AM, Jean-Daniel Cryans < > > > > [email protected]>wrote: > > > > > > > > > >> First, MSLAB has been enabled by default since 0.92.0 as it was > > deemed > > > > >> stable enough. So, unless you are on 0.90, you are already using > it. > > > > >> > > > > >> Also, I'm not sure why you are referencing the HLog in your first > > > > >> paragraph in the context of reading from disk, because the HLogs > are > > > > >> rarely read (only on recovery). Maybe you meant HFile? > > > > >> > > > > >> In any case, your email covers most arguments except for one: > > > > >> checksumming. Retrieving a block from HDFS, even when using short > > > > >> circuit reads to go directly to the OS instead of passing through > > the > > > > >> DN, will take quite a bit more time than reading directly from the > > > > >> block cache. This is why even if you disable block caching on a > > family > > > > >> that the index and root blocks will still be block cached, as > > reading > > > > >> those very hot blocks from disk would take way too long. > > > > >> > > > > >> Regarding your main question (how does the OS buffer help?), I > don't > > > > >> have a good answer. It kind of depends on the amount of RAM you > have > > > > >> and what your workload is like. As a data point, I've been > > successful > > > > >> running with 24GB of heap (50% dedicated to the block cache) with > a > > > > >> workload consisting mainly of small writes, short scans, and a > > typical > > > > >> random read distribution for a website. I can't remember the last > > time > > > > >> I saw a full GC and it's been running for more than a year like > > this. > > > > >> > > > > >> Hope this somehow helps, > > > > >> > > > > >> J-D > > > > >> > > > > >> On Wed, Mar 20, 2013 at 12:34 AM, Pankaj Gupta < > > > [email protected]> > > > > >> wrote: > > > > >>> Given that HBase has it's own cache (block cache and bloom > filters) > > > and > > > > >> that all the table data is stored in HDFS, I'm wondering if HBase > > > > benefits > > > > >> from OS page cache at all. In the set up I'm using HBase Region > > > Servers > > > > run > > > > >> on the same boxes as the HDFS data node. In such a scenario if the > > > > >> underlying HLog files lives on the same machine then having a > > healthy > > > > >> memory surplus may mean that the data node can serve underlying > file > > > > from > > > > >> page cache and thus improving HBase performance. Is this really > the > > > > case? > > > > >> (I guess page cache should also help in case where HLog file lives > > on > > > a > > > > >> different machine but in that case network I/O will probably drown > > the > > > > >> speedup achieved due to not hitting the disk. > > > > >>> > > > > >>> I'm asking because if page cache were useful then in an HBase set > > up > > > > not > > > > >> utilizing all the memory on the machine for the region server may > > not > > > be > > > > >> that bad. The reason one would not want to use all the memory for > > > region > > > > >> server would be long garbage collection pauses that large heap > size > > > may > > > > >> induce. I understand that work has been done to fix the long > pauses > > > > caused > > > > >> due to memory fragmentation in the old generation, mostly > concurrent > > > > >> garbage collector by using slab cache allocator for memstore but > > that > > > > >> feature is marked experimental and we're not ready to take risks > > yet. > > > > So if > > > > >> the page cache was useful in any way on Region Servers we could go > > > with > > > > >> less memory for RegionServer process with the understanding that > > free > > > > >> memory on the machine is not completely going to waste. Thus my > > > > curiosity > > > > >> about utility of os page cache to performance of HBase. > > > > >>> > > > > >>> Thanks in Advance, > > > > >>> Pankaj > > > > >> > > > > > > > > > > > > > >
