Thanks Jianshi for that helpful information, I think for use case 1) it depends on the data ingestion rate when the regions need to split. The synchronous split operation makes some sense there if you want the regions to contain specific time ranges and/or number of records.
For use case 2) I think is a good match for the KeyPrefixRegionSplitPolicy or DelimitedKeyPrefixRegionSplitPolicy. Since the regions will be split based on the <type> if type length is fixed or if the type is of varying length but delimited with | On a second thought, it might be even possible to solve 1) with those prefix based split policies if you use a prefix for your key that also varies monotonically or can be passed by the client when it has reached some threshold, e.g. after writing X billion data points, use prefix 001 and next Y billion data rows use prefix 002 or something like that. cheers, esteban. -- Cloudera, Inc. On Wed, Sep 17, 2014 at 11:53 AM, Jianshi Huang <[email protected]> wrote: > Hi Esteban, > > Two reasons to split dynamically, > > 1) I have a column family that stores timeseries data for mapreduce tasks, > and the rowkey is monotonically increasing to make scanning easier. > > 2) (a better reason), I'm storing multiple types of data in the same table, > and I have about 500TB of data in total. That's many billions of rows and > many thousands of regions. I want to make sure ingesting one type of data > won't touch every region which will cause a lot of fragments and merge > operations, the rowkey is designed as <type>|<hash>|<id>. > > So either way I would want a dynamic split in my design. > > Jianshi > > > On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez <[email protected]> > wrote: > > > Jianshi, > > > > The retry is not an expected behavior that the client should be doing. In > > fact you don't want your clients to issue admin operations to the cluster > > ;) > > > > Shahab's option is the best alternative by polling when the number of > > regions has changed in the table you want to modify the splits > dynamically. > > The JIRA that Ted suggested requires modification in the core table > > operations to support sync operations and requires some major work to do > it > > right. Ted's alternative to create the splits at table creation time is > the > > best option if you can pre-split IMHO. > > > > If you could elaborate more on the practical reasons you mention to > create > > synchronously those new regions that would be great for us. Maybe its > > related to multi-tenancy but I'm just guessing :) > > > > esteban. > > > > > > -- > > Cloudera, Inc. > > > > > > On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu <[email protected]> wrote: > > > > > Jianshi: > > > See HBASE-11608 Add synchronous split > > > > > > bq. createTable does something special? > > > > > > Yes. See this in HBaseAdmin: > > > > > > public void createTable(final HTableDescriptor desc, byte [][] > > splitKeys) > > > > > > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang < > [email protected] > > > > > > wrote: > > > > > > > I see Shahab, async makes sense, but I prefer that the HBase client > > does > > > > the retry for me, and let me specify a timeout parameter. > > > > > > > > One question, does that mean adding multiple splits into one region > has > > > to > > > > be done sequentially? How can I add region splits in parallel? Does > > > > createTable does something special? > > > > > > > > > > > > Jianshi > > > > > > > > > > > > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus < > [email protected]> > > > > wrote: > > > > > > > > > Split is an async operation. When you call it, and the call > returns, > > it > > > > > does not mean that the region has been created yet. > > > > > > > > > > So either you wait for a while (using Thread.sleep) or check for > the > > > > number > > > > > of regions in a loop and until they have increased to the value you > > > want > > > > > and then access the region. The former is not a good idea, though > you > > > can > > > > > try it out just to make sure that this is indeed the issue. > > > > > > > > > > What am I suggesting is something like (pseudo code): > > > > > > > > > > while(new#regions > old#regions) > > > > > { > > > > > new#regions = admin.getLatest#regions > > > > > } > > > > > > > > > > Regards, > > > > > Shahab > > > > > > > > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang < > > > [email protected]> > > > > > wrote: > > > > > > > > > > > I constantly get the following errors when I tried to add splits > > to a > > > > > > table. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): > > > > > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > > > > > > > > > > > > > > > > > > > grapple_vertices,cust|rval#7ffffeb7cffca280|1636500018299676757,1410945568 > > > > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on > > > > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818) > > > > > > at > > > > > > > > > > > > > > > > > > But when I checked the region server (from hbase' webUI), the > > region > > > is > > > > > > actually listed there. > > > > > > > > > > > > What does the error mean actually? How can I solve it? > > > > > > > > > > > > Currently I'm adding splits single-threaded, and I want to make > it > > > > > > parallel, is there anything I need to be careful about? > > > > > > > > > > > > Here's the code for adding splits: > > > > > > > > > > > > def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]): > > Unit > > > > = { > > > > > > val admin = new HBaseAdmin(conn) > > > > > > > > > > > > try { > > > > > > val regions = > > admin.getTableRegions(tableName.getBytes("UTF8")) > > > > > > val regionStartKeys = regions.map(_.getStartKey) > > > > > > val splits = splitKeys.diff(regionStartKeys) > > > > > > > > > > > > splits.foreach { splitPoint => > > > > > > admin.split(tableName.getBytes("UTF8"), splitPoint) > > > > > > } > > > > > > // NOTE: important! > > > > > > admin.balancer() > > > > > > } > > > > > > finally { > > > > > > admin.close() > > > > > > } > > > > > > } > > > > > > > > > > > > > > > > > > Any help is appreciated. > > > > > > > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ >
