Re: directory usage question
On Sep 6, 2014, at 9:32 AM, Ted Yu yuzhih...@gmail.com wrote: Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where HFiles are archived (e.g. when a column family is deleted, HFiles for this column family are stored here). /apps/hbase/data/data/default seems to be your hbase.rootdir hbase.rootdir is defined to be hdfs://foo:8020/apps/hbase/data. I think that's the default that Ambari creates. So the HFiles in the archive subdirectory have been discarded and can be deleted safely? bq. a problem I'm having running map/reduce jobs against snapshots Can you describe the problem in a bit more detail ? I don't understand what I'm seeing well enough to ask an intelligent question yet. I appear to be scanning duplicate rows when using initTableSnapshotMapperJob, but I'm trying to get a better understanding of how this works, since It's probably just something I'm doing wrong. Brian Cheers On Sat, Sep 6, 2014 at 6:09 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: I'm trying to track down a problem I'm having running map/reduce jobs against snapshots. Can someone explain the difference between files stored in: /apps/hbase/data/archive/data/default and files stored in /apps/hbase/data/data/default (Hadoop 2.4, HBase 0.98) Thanks
Re: directory usage question
The files under archive directory are referenced by snapshots. Please don't delete them manually. You can delete unused snapshots. Cheers On Sep 7, 2014, at 4:08 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: On Sep 6, 2014, at 9:32 AM, Ted Yu yuzhih...@gmail.com wrote: Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where HFiles are archived (e.g. when a column family is deleted, HFiles for this column family are stored here). /apps/hbase/data/data/default seems to be your hbase.rootdir hbase.rootdir is defined to be hdfs://foo:8020/apps/hbase/data. I think that's the default that Ambari creates. So the HFiles in the archive subdirectory have been discarded and can be deleted safely? bq. a problem I'm having running map/reduce jobs against snapshots Can you describe the problem in a bit more detail ? I don't understand what I'm seeing well enough to ask an intelligent question yet. I appear to be scanning duplicate rows when using initTableSnapshotMapperJob, but I'm trying to get a better understanding of how this works, since It's probably just something I'm doing wrong. Brian Cheers On Sat, Sep 6, 2014 at 6:09 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: I'm trying to track down a problem I'm having running map/reduce jobs against snapshots. Can someone explain the difference between files stored in: /apps/hbase/data/archive/data/default and files stored in /apps/hbase/data/data/default (Hadoop 2.4, HBase 0.98) Thanks
Re: directory usage question
initTableSnapshotMapperJob writes into this directory (indirectly) via RestoreSnapshotHelper.restoreHdfsRegions Is this expected? I would have expected writes to be limited to the temp directory passed in the init call Brian On Sep 7, 2014, at 8:17 AM, Ted Yu yuzhih...@gmail.com wrote: The files under archive directory are referenced by snapshots. Please don't delete them manually. You can delete unused snapshots. Cheers On Sep 7, 2014, at 4:08 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: On Sep 6, 2014, at 9:32 AM, Ted Yu yuzhih...@gmail.com wrote: Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where HFiles are archived (e.g. when a column family is deleted, HFiles for this column family are stored here). /apps/hbase/data/data/default seems to be your hbase.rootdir hbase.rootdir is defined to be hdfs://foo:8020/apps/hbase/data. I think that's the default that Ambari creates. So the HFiles in the archive subdirectory have been discarded and can be deleted safely? bq. a problem I'm having running map/reduce jobs against snapshots Can you describe the problem in a bit more detail ? I don't understand what I'm seeing well enough to ask an intelligent question yet. I appear to be scanning duplicate rows when using initTableSnapshotMapperJob, but I'm trying to get a better understanding of how this works, since It's probably just something I'm doing wrong. Brian Cheers On Sat, Sep 6, 2014 at 6:09 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: I'm trying to track down a problem I'm having running map/reduce jobs against snapshots. Can someone explain the difference between files stored in: /apps/hbase/data/archive/data/default and files stored in /apps/hbase/data/data/default (Hadoop 2.4, HBase 0.98) Thanks
Re: directory usage question
Eclipse doesn't show that RestoreSnapshotHelper.restoreHdfsRegions() is called by initTableSnapshotMapperJob (in master branch) Looking at TableMapReduceUtil.java in 0.98, I don't see direct relation between the two. Do you have stack trace or something else showing the relationship ? Cheers On Sun, Sep 7, 2014 at 5:48 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: initTableSnapshotMapperJob writes into this directory (indirectly) via RestoreSnapshotHelper.restoreHdfsRegions Is this expected? I would have expected writes to be limited to the temp directory passed in the init call Brian On Sep 7, 2014, at 8:17 AM, Ted Yu yuzhih...@gmail.com wrote: The files under archive directory are referenced by snapshots. Please don't delete them manually. You can delete unused snapshots. Cheers On Sep 7, 2014, at 4:08 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: On Sep 6, 2014, at 9:32 AM, Ted Yu yuzhih...@gmail.com wrote: Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where HFiles are archived (e.g. when a column family is deleted, HFiles for this column family are stored here). /apps/hbase/data/data/default seems to be your hbase.rootdir hbase.rootdir is defined to be hdfs://foo:8020/apps/hbase/data. I think that's the default that Ambari creates. So the HFiles in the archive subdirectory have been discarded and can be deleted safely? bq. a problem I'm having running map/reduce jobs against snapshots Can you describe the problem in a bit more detail ? I don't understand what I'm seeing well enough to ask an intelligent question yet. I appear to be scanning duplicate rows when using initTableSnapshotMapperJob, but I'm trying to get a better understanding of how this works, since It's probably just something I'm doing wrong. Brian Cheers On Sat, Sep 6, 2014 at 6:09 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: I'm trying to track down a problem I'm having running map/reduce jobs against snapshots. Can someone explain the difference between files stored in: /apps/hbase/data/archive/data/default and files stored in /apps/hbase/data/data/default (Hadoop 2.4, HBase 0.98) Thanks
Re: directory usage question
Eclipse doesn't show that RestoreSnapshotHelper.restoreHdfsRegions() is called by initTableSnapshotMapperJob (in master branch) Looking at TableMapReduceUtil.java in 0.98, I don't see direct relation between the two. Do you have stack trace or something else showing the relationship ? Right. That’s what I meant by ‘indirectly’. This is a stack trace that was caused by an ownership conflict: java.io.IOException: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase, access=WRITE, inode=/apps/hbase/data/archive/data/default/Host/c41d632d5eee02e1883215460e5c261d/p:hdfs:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:265) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:251) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:232) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:176) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5509) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5491) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5465) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3608) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3578) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3552) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:754) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:131) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.cloneHdfsRegions(RestoreSnapshotHelper.java:475) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.restoreHdfsRegions(RestoreSnapshotHelper.java:208) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.copySnapshotForScanner(RestoreSnapshotHelper.java:733) at org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat.setInput(TableSnapshotInputFormat.java:397) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableSnapshotMapperJob(TableMapReduceUtil.java:301) at net.digitalenvoy.hp.job.ParseHostnamesJob.run(ParseHostnamesJob.java:77) at net.digitalenvoy.hp.HostProcessor.run(HostProcessor.java:165) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at net.digitalenvoy.hp.HostProcessor.main(HostProcessor.java:47) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Cheers On Sun, Sep 7, 2014 at 5:48 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: initTableSnapshotMapperJob writes into this directory (indirectly) via RestoreSnapshotHelper.restoreHdfsRegions Is this expected? I would have expected writes to be limited to the temp directory passed in the init call Brian On Sep 7, 2014, at 8:17 AM, Ted Yu yuzhih...@gmail.com wrote: The files under archive directory are referenced by snapshots. Please don't delete them manually. You can delete unused snapshots. Cheers On Sep 7, 2014, at 4:08 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: On Sep 6, 2014, at 9:32 AM, Ted Yu yuzhih...@gmail.com wrote: Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where HFiles are archived (e.g. when a column family is deleted, HFiles for this column family are stored here). /apps/hbase/data/data/default seems to be your hbase.rootdir hbase.rootdir is defined to be hdfs://foo:8020/apps/hbase/data. I think that's the default that Ambari creates. So the HFiles in the archive subdirectory have been discarded and can be
need help understand log output
I have a map/reduce job that is consistently failing with timeouts. The failing mapper log files contain a series of records similar to those below. When I look at the hbase and hdfs logs (on foo.net in this case) I don’t see anything obvious at these timestamps. The mapper task times out at/near attempt=25/35. Can anyone shed light on what these log entries mean? Thanks - Brian 2014-09-07 09:36:51,421 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=10/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 10029 ms, replay 1062 ops 2014-09-07 09:37:01,642 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=11/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 10023 ms, replay 1062 ops 2014-09-07 09:37:12,064 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=12/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20182 ms, replay 1062 ops 2014-09-07 09:37:32,708 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=13/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20140 ms, replay 1062 ops 2014-09-07 09:37:52,940 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=14/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20041 ms, replay 1062 ops 2014-09-07 09:38:13,324 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=15/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20041 ms, replay 1062 ops
Re: directory usage question
Your cluster is an insecure HBase deployment, right ? Are all files under /apps/hbase/data/archive/data/default owned by user 'hdfs' ? BTW in tip of 0.98, with HBASE-11742, related code looks a bit different. Cheers On Sun, Sep 7, 2014 at 8:27 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: Eclipse doesn't show that RestoreSnapshotHelper.restoreHdfsRegions() is called by initTableSnapshotMapperJob (in master branch) Looking at TableMapReduceUtil.java in 0.98, I don't see direct relation between the two. Do you have stack trace or something else showing the relationship ? Right. That’s what I meant by ‘indirectly’. This is a stack trace that was caused by an ownership conflict: java.io.IOException: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: Permission denied: user=hbase, access=WRITE, inode=/apps/hbase/data/archive/data/default/Host/c41d632d5eee02e1883215460e5c261d/p:hdfs:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:265) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:251) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:232) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:176) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5509) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5491) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5465) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3608) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3578) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3552) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:754) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.hbase.util.ModifyRegionUtils.createRegions(ModifyRegionUtils.java:131) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.cloneHdfsRegions(RestoreSnapshotHelper.java:475) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.restoreHdfsRegions(RestoreSnapshotHelper.java:208) at org.apache.hadoop.hbase.snapshot.RestoreSnapshotHelper.copySnapshotForScanner(RestoreSnapshotHelper.java:733) at org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat.setInput(TableSnapshotInputFormat.java:397) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableSnapshotMapperJob(TableMapReduceUtil.java:301) at net.digitalenvoy.hp.job.ParseHostnamesJob.run(ParseHostnamesJob.java:77) at net.digitalenvoy.hp.HostProcessor.run(HostProcessor.java:165) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at net.digitalenvoy.hp.HostProcessor.main(HostProcessor.java:47) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Cheers On Sun, Sep 7, 2014 at 5:48 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: initTableSnapshotMapperJob writes into this directory (indirectly) via RestoreSnapshotHelper.restoreHdfsRegions Is this expected? I would have expected writes to be limited to the temp directory passed in the init call Brian On Sep 7, 2014, at 8:17 AM, Ted Yu yuzhih...@gmail.com wrote: The files under archive directory are referenced by snapshots. Please don't delete them manually. You can delete unused snapshots. Cheers On Sep 7, 2014, at 4:08 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: On Sep 6, 2014, at 9:32 AM, Ted Yu yuzhih...@gmail.com wrote: Can you post your hbase-site.xml ? /apps/hbase/data/archive/data/default is where
Re: need help understand log output
When number of attempts is greater than the value of hbase.client.start.log.errors.counter (default 9), AsyncProcess would produce logs cited below. The interval following 'retrying after ' is the backoff time. Which release of HBase are you using ? Cheers On Sun, Sep 7, 2014 at 8:50 AM, Brian Jeltema brian.jelt...@digitalenvoy.net wrote: I have a map/reduce job that is consistently failing with timeouts. The failing mapper log files contain a series of records similar to those below. When I look at the hbase and hdfs logs (on foo.net in this case) I don’t see anything obvious at these timestamps. The mapper task times out at/near attempt=25/35. Can anyone shed light on what these log entries mean? Thanks - Brian 2014-09-07 09:36:51,421 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=10/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 10029 ms, replay 1062 ops 2014-09-07 09:37:01,642 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=11/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 10023 ms, replay 1062 ops 2014-09-07 09:37:12,064 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=12/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20182 ms, replay 1062 ops 2014-09-07 09:37:32,708 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=13/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20140 ms, replay 1062 ops 2014-09-07 09:37:52,940 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=14/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20041 ms, replay 1062 ops 2014-09-07 09:38:13,324 INFO [htable-pool1-t1] org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary, attempt=15/35 failed 1062 ops, last exception: null on foo.net,60020,1406043467187, tracking started null, retrying after 20041 ms, replay 1062 ops
Re: One-table w/ multi-CF or multi-table w/ one-CF?
I would suggest rethinking column families and look at your potential for a slightly different row key. Going with column families doesn’t really make sense. Also how wide are the rows? (worst case?) one idea is to make type part of the RK… HTH -Mike On Sep 7, 2014, at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi Michael, Thanks for the questions. I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of skews in the data, to avoid having too much for the same row, I had to put what was in CQs to now RKs. So CF now acts more like a table. There's one CF containing sequence of events ordered by timestamp, and this CF is quite different as the use case is mostly in mapreduce jobs. Jianshi On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com wrote: Again, a silly question. Why are you using column families? Just to play devil’s advocate in terms of design, why are you not treating your row as a record? Think hierarchal not relational. This really gets in to some design theory. Think Column Family as a way to group data that has the same row key, reference the same thing, yet the data in each column family is used separately. The example I always turn to when teaching, is to think of an order entry system at a retailer. You generate data which is segmented by business process. (order entry, pick slips, shipping, invoicing) All reflect a single order, yet the data in each process tends to be accessed separately. (You don’t need the order entry when using the pick slip to pull orders from the warehouse.) So here, the data access pattern is that each column family is used separately, except in generating the data (the order entry is used to generate the pick slip(s) and set up things like backorders and then the pick process generates the shipping slip(s) etc … And since they are all focused on the same order, they have the same row key. So its reasonable to ask how you are accessing the data and how you are designing your HBase model? Many times, developers create a model using column families because the developer is thinking in terms of relationships. Not access patterns on the data. Does this make sense? On Sep 6, 2014, at 7:46 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like type_of_events#rev_timestamp#id. And with binning, it looks like bin_number#type_of_events#rev_timestamp#id. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to #bin regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Each range might span multiple regions, depending on the data size I want scan for MR jobs. The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu yuzhih...@gmail.com wrote: bq. 16 to 256 ranges Would each range be within single region or the range may span regions ? Are the ranges dynamic ? Using command line for multiple ranges would be out of question. A file with ranges is needed. Cheers On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted for the reference. That's right, extend the row.start and row.end to specify multiple ranges and also getSplits. I would probably bin the event sequence CF into 16 to 256 bins. So 16 to 256 ranges. Jianshi On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu yuzhih...@gmail.com wrote: Please refer to HBASE-5416 Filter on one CF and if a match, then load and return full row bq. to extend TableInputFormat to accept multiple row ranges You mean extending hbase.mapreduce.scan.row.start and hbase.mapreduce.scan.row.stop so that multiple ranges can be specified ? How many such ranges do you normally need ? Cheers On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Thanks Ted, I'll pre-split the table during ingestion. The reason to keep the rowkey monotonic is for easier working with TableInputFormat, otherwise I would've binned it into 256 splits. (well, I think a good way is to extend TableInputFormat to accept multiple row ranges, if there's an existing efficient implementation, please let me know :) Would you elaborate a little more on the heap memory usage during scan? Is there any reference to that? Jianshi On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu yuzhih...@gmail.com wrote: If you use monotonically increasing rowkeys, separating