I'm trying to track down a problem I'm having running map/reduce jobs against
snapshots.
Can someone explain the difference between files stored in:
/apps/hbase/data/archive/data/default
and files stored in
/apps/hbase/data/data/default
(Hadoop 2.4, HBase 0.98)
Thanks
Can you post your hbase-site.xml ?
/apps/hbase/data/archive/data/default is where HFiles are archived (e.g.
when a column family is deleted, HFiles for this column family are stored
here).
/apps/hbase/data/data/default seems to be your hbase.rootdir
bq. a problem I'm having running map/reduce
Hi,
I'm currently putting everything into one table (to make cross reference
queries easier) and there's one CF which contains rowkeys very different to
the rest. Currently it works well, but I'm wondering if it will cause
performance issues in the future.
So my questions are
1) will there be
Is this the same table you mentioned in the thread about RegionTooBusyException
?
If you move the column family to another table, you may have to handle
atomicity yourself - currently atomic operations are within region
boundaries.
Cheers
On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang
Hi Ted,
Yes, that's the table having RegionTooBusyExceptions :) But the performance
I care most are scan performance.
It's mostly for analytics, so I don't care much about atomicity currently.
What's your suggestion?
Jianshi
On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu yuzhih...@gmail.com wrote:
Well, write performance is also important... I'll probably ingest 1k~10k
records/second.
Jianshi
On Sun, Sep 7, 2014 at 1:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Ted,
Yes, that's the table having RegionTooBusyExceptions :) But the
performance I care most are scan
If you use monotonically increasing rowkeys, separating out the column
family into a new table would give you same issue you're facing today.
Using a single table, essential column family feature would reduce the
amount of heap memory used during scan. With two tables, there is no such
facility.
I ran some of Aphyr's Jepsen tests against HBase. The conclusion is that
HBase performed well.
The test I used is at
https://github.com/rayokota/jepsen/blob/old/src/jepsen/hbase.clj.
I used CDH 5.1.2, which bundles hbase-0.98.1+cdh5.1.2+70.
The first test (hbase-app) is to simply create
Thanks Ted,
I'll pre-split the table during ingestion. The reason to keep the rowkey
monotonic is for easier working with TableInputFormat, otherwise I would've
binned it into 256 splits. (well, I think a good way is to extend
TableInputFormat to accept multiple row ranges, if there's an existing
Thank you for doing this Robert. That is some nice work on your part. It
does not look like you had to do many mods to jepsen going by your commits
(though your cji contrib would have had me stumped for one). I was under
the impression jepsen was in a state of transition so had stayed away...
Please refer to HBASE-5416 Filter on one CF and if a match, then load and
return full row
bq. to extend TableInputFormat to accept multiple row ranges
You mean extending hbase.mapreduce.scan.row.start and
hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ?
How many such
Thanks Ted for the reference.
That's right, extend the row.start and row.end to specify multiple ranges
and also getSplits.
I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
256 ranges.
Jianshi
On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote:
Please
Lars,
We are facing a similar situation on the similar cluster configuration...
We are having high I/O wait percentages on some machines in our cluster...
We have short circuit reads enabled but still we are facing the similar
problem.. the cpu wait goes upto 50% also in some case while issuing
Each range might span multiple regions, depending on the data size I want
scan for MR jobs.
The ranges are dynamic, specified by the user, but the number of bins can
be static (when the table/schema is created).
Jianshi
On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote:
bq. 16
Also the hbase version is 0.94.1
On Sun, Sep 7, 2014 at 12:00 AM, kiran kiran.sarvabho...@gmail.com wrote:
Lars,
We are facing a similar situation on the similar cluster configuration...
We are having high I/O wait percentages on some machines in our cluster...
We have short circuit reads
BTW, a little explanation about the binning I mentioned.
Currently the rowkey looks like type_of_events#rev_timestamp#id.
And with binning, it looks like
bin_number#type_of_events#rev_timestamp#id. The bin_number could be
id % 256 or timestamp % 256. And the table could be pre-splitted. So
Again, a silly question.
Why are you using column families?
Just to play devil’s advocate in terms of design, why are you not treating your
row as a record? Think hierarchal not relational.
This really gets in to some design theory.
Think Column Family as a way to group data that has the
What type of drives. controllers, and network bandwidth do you have?
Just curious.
On Sep 6, 2014, at 7:37 PM, kiran kiran.sarvabho...@gmail.com wrote:
Also the hbase version is 0.94.1
On Sun, Sep 7, 2014 at 12:00 AM, kiran kiran.sarvabho...@gmail.com wrote:
Lars,
We are facing a
The answer to can they solve your problem is yes.
For how to do this, start reading about your options here so you can pick
what works best for your need (and your version of hbase):
http://hbase.apache.org/book/ops.backup.html (and links out of this page)
Hi Michael,
Thanks for the questions.
I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a
timestamp and I can query things like events between A and B for the last 7
days.
CFs are used for grouping different types of data for the same account.
However, I have lots of
Thinking about it again, if you ran into a HBASE-7336 you'd see high CPU load,
but *not* IOWAIT.
0.94 is at 0.94.23, you should upgrade. A lot of fixes, improvements, and
performance enhancements went in since 0.94.4.
You can do a rolling upgrade straight to 0.94.23.
With that out of the way,
21 matches
Mail list logo