[jira] [Comment Edited] (HBASE-10191) Move large arena storage off heap

2014-03-04 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13920599#comment-13920599
 ] 

ramkrishna.s.vasudevan edited comment on HBASE-10191 at 3/5/14 7:53 AM:


bq.Would be sweet if the value at least was not on heap
Yes, this could be a nice one. So I think before doing this the usage of Cell 
should be in place.
{Got added by mistake.}


was (Author: ram_krish):
bq.Would be sweet if the value at least was not on heap
Yes, this could be a nice.  

 Move large arena storage off heap
 -

 Key: HBASE-10191
 URL: https://issues.apache.org/jira/browse/HBASE-10191
 Project: HBase
  Issue Type: Umbrella
Reporter: Andrew Purtell

 Even with the improved G1 GC in Java 7, Java processes that want to address 
 large regions of memory while also providing low high-percentile latencies 
 continue to be challenged. Fundamentally, a Java server process that has high 
 data throughput and also tight latency SLAs will be stymied by the fact that 
 the JVM does not provide a fully concurrent collector. There is simply not 
 enough throughput to copy data during GC under safepoint (all application 
 threads suspended) within available time bounds. This is increasingly an 
 issue for HBase users operating under dual pressures: 1. tight response SLAs, 
 2. the increasing amount of RAM available in commodity server 
 configurations, because GC load is roughly proportional to heap size.
 We can address this using parallel strategies. We should talk with the Java 
 platform developer community about the possibility of a fully concurrent 
 collector appearing in OpenJDK somehow. Set aside the question of if this is 
 too little too late, if one becomes available the benefit will be immediate 
 though subject to qualification for production, and transparent in terms of 
 code changes. However in the meantime we need an answer for Java versions 
 already in production. This requires we move the large arena allocations off 
 heap, those being the blockcache and memstore. On other JIRAs recently there 
 has been related discussion about combining the blockcache and memstore 
 (HBASE-9399) and on flushing memstore into blockcache (HBASE-5311), which is 
 related work. We should build off heap allocation for memstore and 
 blockcache, perhaps a unified pool for both, and plumb through zero copy 
 direct access to these allocations (via direct buffers) through the read and 
 write I/O paths. This may require the construction of classes that provide 
 object views over data contained within direct buffers. This is something 
 else we could talk with the Java platform developer community about - it 
 could be possible to provide language level object views over off heap 
 memory, on heap objects could hold references to objects backed by off heap 
 memory but not vice versa, maybe facilitated by new intrinsics in Unsafe. 
 Again we need an answer for today also. We should investigate what existing 
 libraries may be available in this regard. Key will be avoiding 
 marshalling/unmarshalling costs. At most we should be copying primitives out 
 of the direct buffers to register or stack locations until finally copying 
 data to construct protobuf Messages. A related issue there is HBASE-9794, 
 which proposes scatter-gather access to KeyValues when constructing RPC 
 messages. We should see how far we can get with that and also zero copy 
 construction of protobuf Messages backed by direct buffer allocations. Some 
 amount of native code may be required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (HBASE-10191) Move large arena storage off heap

2014-03-02 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13917703#comment-13917703
 ] 

Andrew Purtell edited comment on HBASE-10191 at 3/3/14 2:56 AM:


bq. The problem is that if you have hundreds of 1MB in-memory HFiles, then it 
becomes too expensive to merge them all (via KVHeap) when scanning. A possible 
solution is to subdivide the memstore into stripes (probably smaller than the 
stripe compaction stripes) and periodically compact the in-memory stripes

Anoop, Ram, and I were throwing around ideas of making in-memory HFiles out of 
memstore snapshots, and then doing in-memory compaction over them. If we have 
off-heap backing for memstore we could potentially carry larger snapshots (in 
memory HFiles resulting from a few merged memstore snapshots) leading to less 
frequent flushes and significantly less write amplification overall. 


was (Author: apurtell):
bq. The problem is that if you have hundreds of 1MB in-memory HFiles, then it 
becomes too expensive to merge them all (via KVHeap) when scanning. A possible 
solution is to subdivide the memstore into stripes (probably smaller than the 
stripe compaction stripes) and periodically compact the in-memory stripes

Anoop, Ram, and I were throwing around ideas of making in-memory HFiles out of 
memstore snapshots, and then doing in-memory compaction over them. If we have 
off-heap backing for memstore we could potentially carry larger datasets 
leading to less frequent flushes and significantly less write amplification 
overall. 

 Move large arena storage off heap
 -

 Key: HBASE-10191
 URL: https://issues.apache.org/jira/browse/HBASE-10191
 Project: HBase
  Issue Type: Umbrella
Reporter: Andrew Purtell

 Even with the improved G1 GC in Java 7, Java processes that want to address 
 large regions of memory while also providing low high-percentile latencies 
 continue to be challenged. Fundamentally, a Java server process that has high 
 data throughput and also tight latency SLAs will be stymied by the fact that 
 the JVM does not provide a fully concurrent collector. There is simply not 
 enough throughput to copy data during GC under safepoint (all application 
 threads suspended) within available time bounds. This is increasingly an 
 issue for HBase users operating under dual pressures: 1. tight response SLAs, 
 2. the increasing amount of RAM available in commodity server 
 configurations, because GC load is roughly proportional to heap size.
 We can address this using parallel strategies. We should talk with the Java 
 platform developer community about the possibility of a fully concurrent 
 collector appearing in OpenJDK somehow. Set aside the question of if this is 
 too little too late, if one becomes available the benefit will be immediate 
 though subject to qualification for production, and transparent in terms of 
 code changes. However in the meantime we need an answer for Java versions 
 already in production. This requires we move the large arena allocations off 
 heap, those being the blockcache and memstore. On other JIRAs recently there 
 has been related discussion about combining the blockcache and memstore 
 (HBASE-9399) and on flushing memstore into blockcache (HBASE-5311), which is 
 related work. We should build off heap allocation for memstore and 
 blockcache, perhaps a unified pool for both, and plumb through zero copy 
 direct access to these allocations (via direct buffers) through the read and 
 write I/O paths. This may require the construction of classes that provide 
 object views over data contained within direct buffers. This is something 
 else we could talk with the Java platform developer community about - it 
 could be possible to provide language level object views over off heap 
 memory, on heap objects could hold references to objects backed by off heap 
 memory but not vice versa, maybe facilitated by new intrinsics in Unsafe. 
 Again we need an answer for today also. We should investigate what existing 
 libraries may be available in this regard. Key will be avoiding 
 marshalling/unmarshalling costs. At most we should be copying primitives out 
 of the direct buffers to register or stack locations until finally copying 
 data to construct protobuf Messages. A related issue there is HBASE-9794, 
 which proposes scatter-gather access to KeyValues when constructing RPC 
 messages. We should see how far we can get with that and also zero copy 
 construction of protobuf Messages backed by direct buffer allocations. Some 
 amount of native code may be required.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (HBASE-10191) Move large arena storage off heap

2014-02-20 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906599#comment-13906599
 ] 

Lars Hofhansl edited comment on HBASE-10191 at 2/20/14 4:33 PM:


My office neighbor used to work on a proprietary Java database, and he says 
they used 128GB or even 192GB Java heaps and larger all the time without any 
significant GC impact.

(non moving) Collection times are not a function of the heap size but rather of 
heap complexity, i.e. the number of objects to track (HBase also produces a lot 
of garbage, but that is short lived and can be quickly collected by a moving 
collector for the young gen).
With memstoreLAB and the block cache HBase already does a good job on this. 
Even as is currently, if we fill an entire 128GB of heap with 64k blocks from 
the blockcache that would only be about 2m objects.
Now, if we want  100ms latency area we need to rethink things; that will 
generally be very difficult in current Java.

While we move all-or-nothing everything out of the Java heap, we should also 
investigate whether we can make the GC's life easier, yet.

Edit: Edited for clarity.



was (Author: lhofhansl):
This might not be very popular viewpoint these days, but anyway. My office 
neighbor used to work on a proprietary Java database, and he says they used 
128GB or even 192GB Java heaps and larger all the time without any significant 
GC impact.

(non moving) Collection times are not a function of the heap size but rather of 
heap complexity, i.e. the number of objects to track (HBase also produces a lot 
of garbage, but that is short lived and can be quickly collected by a moving 
collector for the young gen).
With memstoreLAB and the block cache HBase already does a good job on this. 
Even as is currently, if we fill an entire 128GB of heap with 64k blocks from 
the blockcache that would only be about 2m objects.
Now, if we want to forage into the  100ms latency area we need to rethink 
things, but then Java might just not be the right choice.

Before we embark on an all-or-nothing adventure and move everything out of the 
Java heap, we should also investigate whether we can make the GC's life easier, 
yet.


 Move large arena storage off heap
 -

 Key: HBASE-10191
 URL: https://issues.apache.org/jira/browse/HBASE-10191
 Project: HBase
  Issue Type: Umbrella
Reporter: Andrew Purtell

 Even with the improved G1 GC in Java 7, Java processes that want to address 
 large regions of memory while also providing low high-percentile latencies 
 continue to be challenged. Fundamentally, a Java server process that has high 
 data throughput and also tight latency SLAs will be stymied by the fact that 
 the JVM does not provide a fully concurrent collector. There is simply not 
 enough throughput to copy data during GC under safepoint (all application 
 threads suspended) within available time bounds. This is increasingly an 
 issue for HBase users operating under dual pressures: 1. tight response SLAs, 
 2. the increasing amount of RAM available in commodity server 
 configurations, because GC load is roughly proportional to heap size.
 We can address this using parallel strategies. We should talk with the Java 
 platform developer community about the possibility of a fully concurrent 
 collector appearing in OpenJDK somehow. Set aside the question of if this is 
 too little too late, if one becomes available the benefit will be immediate 
 though subject to qualification for production, and transparent in terms of 
 code changes. However in the meantime we need an answer for Java versions 
 already in production. This requires we move the large arena allocations off 
 heap, those being the blockcache and memstore. On other JIRAs recently there 
 has been related discussion about combining the blockcache and memstore 
 (HBASE-9399) and on flushing memstore into blockcache (HBASE-5311), which is 
 related work. We should build off heap allocation for memstore and 
 blockcache, perhaps a unified pool for both, and plumb through zero copy 
 direct access to these allocations (via direct buffers) through the read and 
 write I/O paths. This may require the construction of classes that provide 
 object views over data contained within direct buffers. This is something 
 else we could talk with the Java platform developer community about - it 
 could be possible to provide language level object views over off heap 
 memory, on heap objects could hold references to objects backed by off heap 
 memory but not vice versa, maybe facilitated by new intrinsics in Unsafe. 
 Again we need an answer for today also. We should investigate what existing 
 libraries may be available in this regard. Key will be avoiding 
 marshalling/unmarshalling costs. At most we should be copying 

[jira] [Comment Edited] (HBASE-10191) Move large arena storage off heap

2014-02-19 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906456#comment-13906456
 ] 

Andrew Purtell edited comment on HBASE-10191 at 2/20/14 1:59 AM:
-

I'm looking at Netty 4's netty-buffer module 
(http://netty.io/4.0/api/io/netty/buffer/package-summary.html), which has some 
nice properties, including composite buffers, arena allocation, dynamic buffer 
resizing, and reference counting, never mind dev and testing by another 
community. I also like it because you can plug in your own allocators and 
specialize the abstract ByteBuf base type. More on this later.

When I get closer to seeing what exactly needs to be done I will post a design 
doc. Current thinking follows. Below the term 'buffer' currently means Netty 
ByteBufs or derived classes backed by off-heap allocated direct buffers.

*Write*

When coming in from RPC, cells are laid out by codecs into cellbocks in buffers 
and the cellblocks/buffers are handed to the memstore. Netty's allocation 
arenas replace the MemstoreLAB. The memstore data structure evolves into an 
index over cellblocks.

Per [~mcorgan]'s comment above, we should think about how the memstore index 
can be built with fewer object allocations than the number of cells in the 
memstore, yet be in the ballpark with efficiency of concurrent access. A tall 
order. CSLM wouldn't be the right choice as it allocates at least one list 
entry per key, but we could punt and use it initially and make a replacement 
datastructure as a follow on task.

Cellblocks in memstore should be amenable to flushing to disk as a gathering 
write. This may mean cellblocks have the same internal structure as HFile 
blocks and we reuse all of the block encoder machinery (and simplify them in 
the process).

*Read*

We feed down buffers to HDFS to fill with file block data. We pick which pool 
to get a buffer from for a read depending on family caching strategy. Pools 
could be backed by arenas that match up with LRU policy strata, with a common 
pool/arena for noncaching reads. (Or for noncaching reads, can we optionally 
use a new API for getting buffers up from HDFS, perhaps backed by the pinned 
shared RAM cache, since we know we will be referring to the contents only 
briefly?) It will be important to get reference counting right as we will be 
servicing scans while attempting to evict. Related, eviction of a block may not 
immediately return a buffer to a pool, if there is more than one block in a 
buffer.

We maintain new metrics on numbers of buffers allocated, stats on arenas, stats 
on wastage and internal fragmentation of the buffers, etc, and use these to 
guide optimizations and refinements.


was (Author: apurtell):
I'm looking at Netty 4's netty-buffer module 
(http://netty.io/4.0/api/io/netty/buffer/package-summary.html), which has some 
nice properties, including composite buffers, arena allocation, dynamic buffer 
resizing, and reference counting, never mind dev and testing by another 
community. I also like it because you can plug in your own allocators and 
specialize the abstract ByteBuf base type. More on this later.

When I get closer to seeing what exactly needs to be done I will post a design 
doc. Current thinking follows. Below the term 'buffer' currently means Netty 
ByteBufs or derived classes backed by off-heap allocated direct buffers.

*Write*

When coming in from RPC, cells are laid out by codecs into cellbocks in buffers 
and the cellblocks/buffers are handed to the memstore. Netty's allocation 
arenas replace the MemstoreLAB. The memstore data structure evolves into an 
index over cellblocks.

Per [~mcorgan]'s comment above, we should think about how the memstore index 
can be built with fewer object allocations than the number of cells in the 
memstore, yet be in the ballpark with efficiency of concurrent access. A tall 
order. CSLM wouldn't be the right choice as it allocates at least one list 
entry per key, but we could punt and use it initially and make a replacement 
datastructure as a follow on task.

*Read*

We feed down buffers to HDFS to fill with file block data. We pick which pool 
to get a buffer from for a read depending on family caching strategy. Pools 
could be backed by arenas that match up with LRU policy strata, with a common 
pool/arena for noncaching reads. (Or for noncaching reads, can we optionally 
use a new API for getting buffers up from HDFS, perhaps backed by the pinned 
shared RAM cache, since we know we will be referring to the contents only 
briefly?) It will be important to get reference counting right as we will be 
servicing scans while attempting to evict. Related, eviction of a block may not 
immediately return a buffer to a pool, if there is more than one block in a 
buffer.

We maintain new metrics on numbers of buffers allocated, stats on arenas, stats 
on wastage and internal 

[jira] [Comment Edited] (HBASE-10191) Move large arena storage off heap

2014-02-19 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906456#comment-13906456
 ] 

Andrew Purtell edited comment on HBASE-10191 at 2/20/14 2:07 AM:
-

I'm looking at Netty 4's netty-buffer module 
(http://netty.io/4.0/api/io/netty/buffer/package-summary.html), which has some 
nice properties, including composite buffers, arena allocation, dynamic buffer 
resizing, and reference counting, never mind dev and testing by another 
community. I also like it because you can plug in your own allocators and 
specialize the abstract ByteBuf base type. More on this later.

When I get closer to seeing what exactly needs to be done I will post a design 
doc. Current thinking follows. Below the term 'buffer' currently means Netty 
ByteBufs or derived classes backed by off-heap allocated direct buffers.

*Write*

When coming in from RPC, cells are laid out by codecs into cellbocks in buffers 
and the cellblocks/buffers are handed to the memstore. Netty's allocation 
arenas replace the MemstoreLAB. The memstore data structure evolves into an 
index over cellblocks.

Per [~mcorgan]'s comment above, we should think about how the memstore index 
can be built with fewer object allocations than the number of cells in the 
memstore, yet be in the ballpark with efficiency of concurrent access. A tall 
order. CSLM wouldn't be the right choice as it allocates at least one list 
entry per key, but we could punt and use it initially and make a replacement 
datastructure as a follow on task.

Cellblocks in memstore should be amenable to flushing to disk as a gathering 
write. This may mean cellblocks have the same internal structure as HFile 
blocks and we reuse all of the block encoder machinery (and simplify them in 
the process).

*Read*

We feed down buffers to HDFS to fill with file block data. We pick which pool 
to get a buffer from for a read depending on family caching strategy. Pools 
could be backed by arenas that match up with LRU policy strata, with a common 
pool/arena for noncaching reads. (Or for noncaching reads, can we optionally 
use a new API for getting buffers up from HDFS, perhaps backed by the pinned 
shared RAM cache, since we know we will be referring to the contents only 
briefly?) It will be important to get reference counting right as we will be 
servicing scans while attempting to evict. Related, eviction of a block may not 
immediately return a buffer to a pool, if there is more than one block in a 
buffer.

We maintain new metrics on numbers of buffers allocated, stats on arenas, stats 
on wastage and internal fragmentation of the buffers, etc, and use these to 
guide optimizations and refinements.

This should require fewer changes than the write side since we are already set 
up for dealing with cellblocks. Design points to optimize would be minimizing 
the number and size of data copies, minimizing the number of on-heap object 
allocations, and on disk encoding suitable as an efficient in-memory 
representation.


was (Author: apurtell):
I'm looking at Netty 4's netty-buffer module 
(http://netty.io/4.0/api/io/netty/buffer/package-summary.html), which has some 
nice properties, including composite buffers, arena allocation, dynamic buffer 
resizing, and reference counting, never mind dev and testing by another 
community. I also like it because you can plug in your own allocators and 
specialize the abstract ByteBuf base type. More on this later.

When I get closer to seeing what exactly needs to be done I will post a design 
doc. Current thinking follows. Below the term 'buffer' currently means Netty 
ByteBufs or derived classes backed by off-heap allocated direct buffers.

*Write*

When coming in from RPC, cells are laid out by codecs into cellbocks in buffers 
and the cellblocks/buffers are handed to the memstore. Netty's allocation 
arenas replace the MemstoreLAB. The memstore data structure evolves into an 
index over cellblocks.

Per [~mcorgan]'s comment above, we should think about how the memstore index 
can be built with fewer object allocations than the number of cells in the 
memstore, yet be in the ballpark with efficiency of concurrent access. A tall 
order. CSLM wouldn't be the right choice as it allocates at least one list 
entry per key, but we could punt and use it initially and make a replacement 
datastructure as a follow on task.

Cellblocks in memstore should be amenable to flushing to disk as a gathering 
write. This may mean cellblocks have the same internal structure as HFile 
blocks and we reuse all of the block encoder machinery (and simplify them in 
the process).

*Read*

We feed down buffers to HDFS to fill with file block data. We pick which pool 
to get a buffer from for a read depending on family caching strategy. Pools 
could be backed by arenas that match up with LRU policy strata, with a common 
pool/arena for 

[jira] [Comment Edited] (HBASE-10191) Move large arena storage off heap

2013-12-17 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13851148#comment-13851148
 ] 

Andrew Purtell edited comment on HBASE-10191 at 12/18/13 12:18 AM:
---

bq. Memstore and BlockCache are commonly cited as the offending components, but 
I've not seen anyone present conclusive profiling results making this clear

It's abundantly clear once using heaps larger than ~8 GB that collection pauses 
under safepoint blow out latency SLAs at the high percentiles. Why would we 
need heaps larger than this? To take direct advantage of large server RAM. 
Memstore and blockcache are then the largest allocators of heap memory. If we 
move them off heap, they can soak up most of the available RAM, leaving 
remaining heap demand relatively small - this is the idea.

Edit: Phrasing


was (Author: apurtell):
bq. Memstore and BlockCache are commonly cited as the offending components, but 
I've not seen anyone present conclusive profiling results making this clear

It's abundantly clear once using heaps larger than ~8 GB that collection pauses 
under safepoint blow out latency SLAs at the high percentiles. I've observed 
this directly under mixed read+write load. (Read-only loads work ok with G1 
even with very large heaps, e.g. 192 GB.) Why would we need heaps larger than 
this? To take direct advantage of large server RAM. Memstore and blockcache are 
then the largest allocators of heap memory. If we move them off heap, they can 
soak up most of the available RAM, leaving remaining heap demand relatively 
small - this is the idea.

 Move large arena storage off heap
 -

 Key: HBASE-10191
 URL: https://issues.apache.org/jira/browse/HBASE-10191
 Project: HBase
  Issue Type: Umbrella
Reporter: Andrew Purtell

 Umbrella issue for moving large arena storage off heap.
 Even with the improved G1 GC in Java 7, Java processes that want to address 
 large regions of memory while also providing low high-percentile latencies 
 continue to be challenged. Fundamentally, a Java server process that has high 
 data throughput and also tight latency SLAs will be stymied by the fact that 
 the JVM does not provide a fully concurrent collector. There is simply not 
 enough throughput to copy data during GC under safepoint (all application 
 threads suspended) within available time bounds. This is increasingly an 
 issue for HBase users operating under dual pressures: 1. tight response SLAs, 
 2. the increasing amount of RAM available in commodity server 
 configurations, because GC load is roughly proportional to heap size.
 We can address this using parallel strategies. We should talk with the Java 
 platform developer community about the possibility of a fully concurrent 
 collector appearing in OpenJDK somehow. Set aside the question of if this is 
 too little too late, if one becomes available the benefit will be immediate 
 though subject to qualification for production, and transparent in terms of 
 code changes. However in the meantime we need an answer for Java versions 
 already in production. This requires we move the large arena allocations off 
 heap, those being the blockcache and memstore. On other JIRAs recently there 
 has been related discussion about combining the blockcache and memstore 
 (HBASE-9399) and on flushing memstore into blockcache (HBASE-5311), which is 
 related work. We should build off heap allocation for memstore and 
 blockcache, perhaps a unified pool for both, and plumb through zero copy 
 direct access to these allocations (via direct buffers) through the read and 
 write I/O paths. This may require the construction of classes that provide 
 object views over data contained within direct buffers. This is something 
 else we could talk with the Java platform developer community about - it 
 could be possible to provide language level object views over off heap 
 memory, on heap objects could hold references to objects backed by off heap 
 memory but not vice versa, maybe facilitated by new intrinsics in Unsafe. 
 Again we need an answer for today also. We should investigate what existing 
 libraries may be available in this regard. Key will be avoiding 
 marshalling/unmarshalling costs. At most we should be copying primitives out 
 of the direct buffers to register or stack locations until finally copying 
 data to construct protobuf Messages. A related issue there is HBASE-9794, 
 which proposes scatter-gather access to KeyValues when constructing RPC 
 messages. We should see how far we can get with that and also zero copy 
 construction of protobuf Messages backed by direct buffer allocations. Some 
 amount of native code may be required.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)