directory usage question

2014-09-06 Thread Brian Jeltema
I'm trying to track down a problem I'm having running map/reduce jobs against snapshots. Can someone explain the difference between files stored in: /apps/hbase/data/archive/data/default and files stored in /apps/hbase/data/data/default (Hadoop 2.4, HBase 0.98) Thanks

Re: directory usage question

2014-09-06 Thread Ted Yu
Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where HFiles are archived (e.g. when a column family is deleted, HFiles for this column family are stored here). /apps/hbase/data/data/default seems to be your hbase.rootdir bq. a problem I'm having running map/reduce

One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi, I'm currently putting everything into one table (to make cross reference queries easier) and there's one CF which contains rowkeys very different to the rest. Currently it works well, but I'm wondering if it will cause performance issues in the future. So my questions are 1) will there be

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Ted Yu
Is this the same table you mentioned in the thread about RegionTooBusyException ? If you move the column family to another table, you may have to handle atomicity yourself - currently atomic operations are within region boundaries. Cheers On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan performance. It's mostly for analytics, so I don't care much about atomicity currently. What's your suggestion? Jianshi On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Well, write performance is also important... I'll probably ingest 1k~10k records/second. Jianshi On Sun, Sep 7, 2014 at 1:11 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Ted, Yes, that's the table having RegionTooBusyExceptions :) But the performance I care most are scan

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Ted Yu
If you use monotonically increasing rowkeys, separating out the column family into a new table would give you same issue you're facing today. Using a single table, essential column family feature would reduce the amount of heap memory used during scan. With two tables, there is no such facility.

Call Me Maybe HBase

2014-09-06 Thread Robert Yokota
I ran some of Aphyr's Jepsen tests against HBase. The conclusion is that HBase performed well. The test I used is at https://github.com/rayokota/jepsen/blob/old/src/jepsen/hbase.clj. I used CDH 5.1.2, which bundles hbase-0.98.1+cdh5.1.2+70. The first test (hbase-app) is to simply create

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing

Re: Call Me Maybe HBase

2014-09-06 Thread Stack
Thank you for doing this Robert. That is some nice work on your part. It does not look like you had to do many mods to jepsen going by your commits (though your cji contrib would have had me stumped for one). I was under the impression jepsen was in a state of transition so had stayed away...

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Ted Yu
Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please

Re: HBase - Performance issue

2014-09-06 Thread kiran
Lars, We are facing a similar situation on the similar cluster configuration... We are having high I/O wait percentages on some machines in our cluster... We have short circuit reads enabled but still we are facing the similar problem.. the cpu wait goes upto 50% also in some case while issuing

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16

Re: HBase - Performance issue

2014-09-06 Thread kiran
Also the hbase version is 0.94.1 On Sun, Sep 7, 2014 at 12:00 AM, kiran kiran.sarvabho...@gmail.com wrote: Lars, We are facing a similar situation on the similar cluster configuration... We are having high I/O wait percentages on some machines in our cluster... We have short circuit reads

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like type_of_events#rev_timestamp#id. And with binning, it looks like bin_number#type_of_events#rev_timestamp#id. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Michael Segel
Again, a silly question. Why are you using column families? Just to play devil’s advocate in terms of design, why are you not treating your row as a record? Think hierarchal not relational. This really gets in to some design theory. Think Column Family as a way to group data that has the

Re: HBase - Performance issue

2014-09-06 Thread Michael Segel
What type of drives. controllers, and network bandwidth do you have? Just curious. On Sep 6, 2014, at 7:37 PM, kiran kiran.sarvabho...@gmail.com wrote: Also the hbase version is 0.94.1 On Sun, Sep 7, 2014 at 12:00 AM, kiran kiran.sarvabho...@gmail.com wrote: Lars, We are facing a

Re: question about incremental backup and cluster replication

2014-09-06 Thread Suraj Varma
The answer to can they solve your problem is yes. For how to do this, start reading about your options here so you can pick what works best for your need (and your version of hbase): http://hbase.apache.org/book/ops.backup.html (and links out of this page)

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-06 Thread Jianshi Huang
Hi Michael, Thanks for the questions. I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of

Re: HBase - Performance issue

2014-09-06 Thread lars hofhansl
Thinking about it again, if you ran into a HBASE-7336 you'd see high CPU load, but *not* IOWAIT. 0.94 is at 0.94.23, you should upgrade. A lot of fixes, improvements, and performance enhancements went in since 0.94.4. You can do a rolling upgrade straight to 0.94.23. With that out of the way,