This is interesting. Any chance that the cells on the regions hosted on server A are 5M in size?
The hfile block sizes are by default configured to be 64k but rare would an hfile block ever be exactly 64k. We do not cut the hfile block content at 64k exactly. The hfile block boundary will be at a keyvalue boundary. If a cell were 5MB, it does not get split across multiple hfile blocks. It will occupy one hfile block. Could it be that the region hosted on A is not like the others and it has lots of these 5MB sizes? Let us know. If above is not the case, then you have an interesting phenomenon going on and we need to dig in more. St.Ack On Thu, Jul 14, 2011 at 5:27 AM, Mingjian Deng <[email protected]> wrote: > Hi: > we found a strange problem in our read test. > It is a 5 nodes cluster.Four of our 5 regionservers > set hfile.block.cache.size=0.4, one of them is 0.1(node A). When we random > read from a 2TB data table we found node A's network reached 100MB, and > others are less than 10MB. We kown node A need to read data from disks and > put them in blockcache. In the follow codes in LruBlockCache: > -------------------------------------------------------------------------------------------------------------------------- > public void cacheBlock(String blockName, ByteBuffer buf, boolean inMemory) > { > CachedBlock cb = map.get(blockName); > if(cb != null) { > throw new RuntimeException("Cached an already cached block"); > } > cb = new CachedBlock(blockName, buf, count.incrementAndGet(), inMemory); > long newSize = size.addAndGet(cb.heapSize()); > map.put(blockName, cb); > elements.incrementAndGet(); > if(newSize > acceptableSize() && !evictionInProgress) { > runEviction(); > } > } > -------------------------------------------------------------------------------------------------------------------------- > > > > > We debugged this code with btrace like follow code: > -------------------------------------------------------------------------------------------------------------------------- > import static com.sun.btrace.BTraceUtils.*; > import com.sun.btrace.annotations.*; > > import java.nio.ByteBuffer; > import org.apache.hadoop.hbase.io.hfile.*; > > @BTrace public class TestRegion{ > @OnMethod( > clazz="org.apache.hadoop.hbase.io.hfile.LruBlockCache", > method="cacheBlock" > ) > public static void traceCacheBlock(@Self LruBlockCache instance,String > blockName, ByteBuffer buf, boolean inMemory){ > println(strcat("size: > ",str(get(field("org.apache.hadoop.hbase.io.hfile.LruBlockCache","size"),instance)))); > println(strcat("elements: > ",str(get(field("org.apache.hadoop.hbase.io.hfile.LruBlockCache","elements"),instance)))); > } > } > -------------------------------------------------------------------------------------------------------------------------- > > > > We found that the "size" increace 5 MB each time in node A! Why not 64 KB > each time?? But the "size" increace 64 KB when we run this btrace code in > other nodes at the same time. > > The follow codes also confirm the problem because the "decompressedSize" > is 5 MB each time in node A! > ------------------------------------------------------------------------------------------------------------------------- > import static com.sun.btrace.BTraceUtils.*; > import com.sun.btrace.annotations.*; > > import java.nio.ByteBuffer; > import org.apache.hadoop.hbase.io.hfile.*; > > @BTrace public class TestRegion1{ > @OnMethod( > clazz="org.apache.hadoop.hbase.io.hfile.HFile$Reader", > method="decompress" > ) > public static void traceCacheBlock(final long offset, final int > compressedSize, > final int decompressedSize, final boolean pread){ > println(strcat("decompressedSize: ",str(decompressedSize))); > } > } > ------------------------------------------------------------------------------------------------------------------------- > > > > Why not 64 KB? > > BTW: When we set hfile.block.cache.size=0.4 in node A, the > "decompressedSize" down to 64 KB, and the tps is up to high level. >
