Our data can be characterized as a list of sets and 1 row == element
of a set.
Our puts and gets work on a set at a time. Our sets typically range
from 1~1000 elements and few can range from (1k-20k) elements.
Can't guarantee it is a perfect codebase but do use HTablePool for
reusing HTable.
What I wanted out of this discussion was to find out whether I am in
the ballpark of what I can juice out of HBase or I am way off the mark.
~Jacob
On May 28, 2010, at 7:16 PM, Jean-Daniel Cryans <jdcry...@apache.org>
wrote:
Looks like you spend 1/6 of your time doing the gets, good to know.
For autoflush=false, if you fit the 4-5KB in a single Put, then it
won't help as 1 put = 1 rpc. Batching them together almost always
improve performance. The default buffer size is 2MB btw.
LZO should give you another big boost, at least if your data can be
compressed in any way. Also watch out for stuff that takes a lot of
time in your code like instantiating lots of HTables (reuse same as
much as you can inside a single thread), use finals, etc. I saw a good
bunch of people shooting themselves in the foot by writing poorly
performant code, crazy how running 800M times the same slowish thing
ends up taking hours!
J-D
On Fri, May 28, 2010 at 4:11 PM, Jacob Isaac <ja...@ebrary.com> wrote:
Here is the summary of the runs
puts (~4-5k per row)
regionsize #rows Total time (ms)
1G 82282053*2 301943742
512M 82287593*2 313119378
256M 82246314*2 433200105
gets ((~4-5k per row)
regionsize #rows Total time (ms)
1G 82427685 90116726
512M 82421943 94878466
256M 82395487 108160178
Note : for the 256m run the hbase.hregion.memstore.flush.size=64m
and for the other two runs the hbase.hregion.memstore.flush.size=96m
Regarding disabling autoflush - since there are large number of
writes(~4k
per row) happening we would
have hit the hbase.client.write.buffer size every few seconds.
~Jacob
On Fri, May 28, 2010 at 1:36 PM, Jacob Isaac <ja...@ebrary.com>
wrote:
Vidhya - This is using HBase API.
J-D - I do have timing info for inserts and gets - Let me process
the data
and will post the results.
~Jacob.
On Fri, May 28, 2010 at 1:16 PM, Vidhyashankar Venkataraman <
vidhy...@yahoo-inc.com> wrote:
Jacob,
Just curious: Is your observed upload throughput that of bulk
importing
or using the Hbase API?
Thanks
Vidhya
On 5/28/10 1:13 PM, "Jacob Isaac" <ja...@ebrary.com> wrote:
Hi J-D
The run was done on a reformatted hdfs.
Disabling WAL is not an option for us bcos this will be our
normal mode of
operation and durability is important to us.
It was poor choice of words - 'upload' by me - it is more like
periodic/continous writes
hbase.regionserver.maxlogs was 256 although
hbase.regionserver.hlog.blocksize was the default.
Did not use compression. And autoflush is default (true)
Each of the 20 node is running custom server program that's
reading and
writing to HBase
Max of 6 write threads per node and 1 thread reading
Also wanted to point out that in the current tests we are writing
to two
tables and reading from only one
~Jacob
On Fri, May 28, 2010 at 12:42 PM, Jean-Daniel Cryans <jdcry...@apache.org
wrote:
If the table was already created, changing
hbase.hregion.max.filesize
and hbase.hregion.memstore.flush.size won't be considered, those
are
the default values for new tables. You can set it in the shell
too,
see the "alter" command.
Also, did you restart HBase? Did you push the configs to all
nodes?
Did you disable writing to the WAL? If not, because durability is
still important to you but you want to upload as fast as you
can, I
would recommend changing this too:
hbase.regionserver.hlog.blocksize 134217728
hbase.regionserver.maxlogs 128
I forgot you had quite largish values, so that must affect the log
rolling a _lot_.
Finally, did you LZOed the table? From experience, it will only do
good http://wiki.apache.org/hadoop/UsingLzoCompression
And finally (for real this time), how are you uploading to
HBase? How
many clients? Are you even using the write buffer?
http://hadoop.apache.org/hbase/docs/r0.20.4/api/org/apache/hadoop/hbase/client/HTable.html#setAutoFlush(boolean)
J-D
On Fri, May 28, 2010 at 12:28 PM, Jacob Isaac <ja...@ebrary.com>
wrote:
Did a run yesterday, posted the relevant parameters below.
Did not see any difference in throughput or total run time (~9
hrs)
I am consistently getting about 5k rows/sec, each row around
~4-5k
using a 17 node Hbase on 20 node HDFS cluster
How does it compare?? Can I juice it more?
~Jacob
<property>
<name>hbase.regionserver.handler.count</name>
<value>60</value>
</property>
<property>
<name>hbase.hregion.max.filesize</name>
<value>1073741824</value>
</property>
<property>
<name>hbase.hregion.memstore.flush.size</name>
<value>100663296</value>
</property>
<property>
<name>hbase.hstore.blockingStoreFiles</name>
<value>15</value>
</property>
<property>
<name>hbase.hstore.compactionThreshold</name>
<value>4</value>
</property>
<property>
<name>hbase.hregion.memstore.block.multiplier</name>
<value>8</value>
</property>
On Fri, May 28, 2010 at 10:15 AM, Jean-Daniel Cryans <
jdcry...@apache.org>wrote:
Like I said in my first email, it helps for random reading
when lots
of RAM is available to HBase. But it won't help the write
throughput.
J-D
On Fri, May 28, 2010 at 10:12 AM, Vidhyashankar Venkataraman
<vidhy...@yahoo-inc.com> wrote:
I am not sure if I understood this right, but does changing
hfile.block.cache.size also help?
On 5/27/10 3:27 PM, "Jean-Daniel Cryans" <jdcry...@apache.org>
wrote:
Well we do have a couple of other configs for high write
throughput:
<property>
<name>hbase.hstore.blockingStoreFiles</name>
<value>15</value>
</property>
<property>
<name>hbase.hregion.memstore.block.multiplier</name>
<value>8</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>60</value>
</property>
<property>
<name>hbase.regions.percheckin</name>
<value>100</value>
</property>
The last one is for restarts. Uploading very fast, you will
more
likely hit all the upper limits (blocking store file and
memstore)
and
this will lower your throughput. Those configs relax that.
Also for
speedier uploads we disable writing to the WAL
http://hadoop.apache.org/hbase/docs/r0.20.4/api/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)
.
If the job fails or any machines fails you'll have to restart
it or
figure the whole, and you absolutely need to force flushes
when the
MR
is done.
J-D
On Thu, May 27, 2010 at 2:57 PM, Jacob Isaac <ja...@ebrary.com>
wrote:
Thanks J-D
Currently we are trying to find/optimize our load/write
times -
although
in
prod we expect it to be 25/75 (writes/reads) ratio.
We are using long table model with only one column - row-
size is
typically ~
4-5k
As to your suggestion on not using even 50% of disk space - I
agree
and
was
planning to use only ~30-40% (1.5T of 4T) for HDFS
and as I reported earlier
4000 regi...@256m per region(with 3 replications) on 20
nodes ==
150G
per/node == 10% utilization
while using 1GB as maxfilesize did you have to adjust other
params
such
as hbase.hstore.compactionThreshold and
hbase.hregion.memstore.flush.size.
There is an interesting observation by Jonathan Gray
documented/reported
in
HBASE-2375 -
wondering whether that issue gets compounded when using 1G
as the
hbase.hregion.max.filesize
Thx
Jacob