Hi, Andrew:
Thanks for the suggestion. In the use case that I am considering, when TTL is set to 10 minutes, all data will fit in memory in 3 regions. However, when TTL is set to longer time, it will not fit in memory. Some of our table's TTL can be set to 2 weeks, 1 month , or 1 year. persistence is still needed in case the regionserver shuts down. We want to use hbase for this use case because of the large amount of data that flows through. What I am actually testing is to find out what happens when there is a large turn over of the data in the regions. I found that if I manually run major_compact against the table, it helps a lot. the older data gets removed. hbase is supposed to run major_compact every day, but searching from the log, and I found that is not the case. I have found situations that the major_compact didn't happen for several days. for the tables with TTL, I found the lack of major_compact greatly impacted the read performance. after insertion to the table running for 1 day, counting the rows of the table took more than 450 seconds. after the major_compact, the same operation took only 65 seconds. In the end, I resort to command : echo 'major_compact 'table_name'" | hbase shell, put it in a cron job for those tables with high data turn over and run it every hourly, and I am still testing and see if it helps with this situation.

Jimmy


--------------------------------------------------
From: "Andrew Purtell" <[email protected]>
Sent: Sunday, September 19, 2010 7:37 AM
To: <[email protected]>
Subject: Re: lack of region merge cause in_memory option trouble

Hi Jimmy,

IN_MEMORY may not mean what you think. It does not turn off disk persistence, flushing, etc. It is a suggestion to the regionserver that all of the data for the region be retained in block cache.

Also, as I said before your test case is not really what the current TTL implementation targets. If you want it to work better for you given such short TTLs, it may make sense to modify the memstore to simply not flush values with short TTLs, if they will expire in a few minutes or seconds.

The idea is that we are only interested in last 10 minute's data,
as data gets older, it will be purged, and the amount of memory
and disk usage will remain low. [...]

What is the anticipated data volume within that 10 minute window? Will it fit all in RAM on a single server? Or perhaps a small cluster of servers?

The BigTable/HBase design targets large data scale, and the implementation is optimized for that, a distributed, elastic, **persistent** sparse map with multidimensional keys. What you are talking about here way on the other end of the spectrum, and persistence may not be something you want.

  - Andy

From: Jinsong Hu <[email protected]>
Subject: lack of region merge cause in_memory option trouble
To: [email protected]
Date: Friday, September 17, 2010, 2:53 PM
Hi,
 I was trying to find out if the hbase can be used in
real-time processing scenario. In order to
do so, I set the in_memory for a table to be true, and set
the TTL for the table to 10 minuets.
The data comes in chronnological order. I let the test to
run for 1 day. The idea is that we are only
interested in last 10 minute's data. as data gets older, it
will be purged, and the amount of memory and disk usage will
remain low.
 What I found is that the region number continue to grow ,
and overnight it created 46 regions. the HDFS shows it used
8.6G of disk space. This is one order of magnitude higher
than what I estimate in the ideal case. The data rate that I
am pumping is only 3 regions/hour. I would imagine that we
will only have less than 3 regions in hbase for this kind of
situation, and only 700M in terms of HDFS usage, regardless
how long I run the test.
 I understand that the region merge request is already
filed. Does anybody know when that will be implemented ?

Jimmy.






Reply via email to