-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Wednesday, September 01, 2010 10:45 PM
To: [email protected]
Subject: Re: HBase table lost on upgrade
On Wed, Sep 1, 2010 at 5:49 PM, Sharma, Avani <[email protected]> wrote:
> That email was just informational. Below are the details on my cluster - let
> me know if more is needed.
>
> I have 2 hbase clusters setup
> - for production, 6 node cluster, 32G, 8 processors
> - for dev, 3 node cluster , 16GRAM , 4 processors
>
> 1. I installed hadoop0.20.2 and hbase0.20.3 on both these clusters,
> successfully.
Why not latest stable version, 0.20.6?
This was couple of months ago.
> 2. After that I loaded 2G+ files into HDFS and HBASE table.
Whats this mean? Each of the .5M cells was 2G in size or the total size was 2G?
The total file size is 2G. Cells are of the order of hundreds of bytes.
> An example Hbase table looks like this:
> {NAME =>'TABLE', FAMILIES => [{NAME => 'data', VERSIONS =>
> '100', COM true
> PRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
> '65536', IN_MEMO
> RY => 'false', BLOCKCACHE => 'true'}]}
That looks fine.
> 3. I started stargate on one server and accessed Hbase for reading from
> another 3rd party application successfully.
> It took 600 seconds on dev cluster and 250 on production to read .5M
> records from Hbase via stargate.
That don't sound so good.
> 4. later to boost read performance, it was suggested that upgrading to
> Hbase0.20.6 will be helpful. I did that on production (w/o running the
> migrate script) and re-started stargate and everything was running fine,
> though I did not see a bump in performance.
>
> 5. Eventually, I had to move to dev cluster from production because of some
> resource issues at our end. Dev cluster had 0.20.3 at this time. As I started
> loading more files into Hbase (<10 versions of <1G files) and converting my
> app to use hbase more heavily (via more stargate clients), the performance
> started degrading. I decided it was time to upgrade dev cluster as well to
> 0.20.6. (I did not run the migrate script here as well, I missed this step
> in the doc).
>
What kinda perf you looking for from REST?
Do you have to use REST? All is base64'd so its safe to transport.
I also have the Java Api code (for testing purposes) and that gave similar
performance results (520 seconds on dev and 250 on production cluster). Is
there a way to flush the cache before we run the next experiment? I doubt that
the first lookup always takes longer and then the later ones perform better.
I need something that can integrate with C++ - libcurl and stargate were the
easiest to start with. I could look at thrift or anything else the Hbase gurus
think might be a better fit performance-wise.
> 6. When Hbase 0.20.6 came back up on dev cluster (with increased block cache
> (.6) and region server handler counts (75) ), pointing to the same rootdir, I
> noticed that some tables were missing. I could see a mention of them in the
> logs, but not when I did 'list' in the shell. I recovered those tables using
> add_table.rb script.
How did you shutdown this cluster? Did you reboot machines? Was your
hdfs homed on /tmp? What is going on on your systems? Are they
swapping? Did you give HBase more than its default memory? You read
the requirements and made sure ulimit and xceivers had been upped on
these machines?
Did not reboot machines. hdfs or hbase do not store data/logs in /tmp. They are
not swapping.
Hbase heap size is 2G. I have upped the xcievers now on your recommanedation.
Do I need to restart hdfs after making this change in hdfs-site.xml ?
ulimit -n
2048
> a. Is there a way to check the health of all Hbase tables in the
> cluster after an upgrade or even periodically, to make sure that everything
> is healthy ?
> b. I would like to be able to force this error again and check the
> health of hbase and want it to report to me that some tables were lost.
> Currently, I just found out because I had very less data and it was easy to
> tell.
>
Iin trunk there is such a tool. In 0.20.x, run a count against our
table. See the hbase shell. Type help to see how.
What tool are you talking about here - it wasn't clear ? Count against which
table ? I want hbase to check all tables and I don't know how many tables I
have since there are too many - is that possible?
> 7. Here are the issues I face after this upgrade
> a. when I run stop-hbase.sh, it does not stop my regionservers on
> other boxes.
Why not. Whats going on on those machines? If you tail the logs on
the hosts that won't go down and/or on master, what do they say?
Tail the logs. Should give you (us) clue.
They do go down with some errors in the log, but down't report it on the
terminal.
http://pastebin.com/0hYwaffL regionserver log
> b. It does start them using start-hbase.sh.
> c. Is it that stopping regionservers is not reported, but it does stop
> them (I see that happening on production cluster) ?
>
> 8. I started stargate in the upgraded 0.20.6 in dev cluster
> a. earlier when I sent a URL to look for a data row that did not
> exist, the return value was NULL , now I get an xml stating HTTP error
> 404/405. Everything works as expected for an existing data row.
The latter sounds RESTy. What would you expect of it? The null?
Yes, it should send NULL like it does in the production server. Is there anyone
else you can point to who would have used REST ? This is the main showstopper
for me currently.