Thanks Esteban for the suggestion. For case 2) KeyPrefixRegionSplitPolicy won't be enough I think as we're constantly adding new types so the #types is unknown at the beginning, and when there's a new type of data, it will add pre-splits [type|00, type|01, ..., type|FF] to the table. Data is ingested one type after another so if there's no auto-splits, ingestion will be too slow.
For case 1) I thought about binning, however it makes scans in tableInputFormat more complicated. I think auto pre-splits can solve it so currently a sampling process is run to compute the splitKeys for every ts data to be ingested. Jianshi On Thu, Sep 18, 2014 at 3:19 AM, Esteban Gutierrez <[email protected]> wrote: > Thanks Jianshi for that helpful information, > > I think for use case 1) it depends on the data ingestion rate when the > regions need to split. The synchronous split operation makes some sense > there if you want the regions to contain specific time ranges and/or > number of records. > > For use case 2) I think is a good match for the KeyPrefixRegionSplitPolicy > or DelimitedKeyPrefixRegionSplitPolicy. Since the regions will be split > based on the <type> if type length is fixed or if the type is of varying > length but delimited with | > > On a second thought, it might be even possible to solve 1) with those > prefix based split policies if you use a prefix for your key that also > varies monotonically or can be passed by the client when it has reached > some threshold, e.g. after writing X billion data points, use prefix 001 > and next Y billion data rows use prefix 002 or something like that. > > cheers, > esteban. > > > -- > Cloudera, Inc. > > > On Wed, Sep 17, 2014 at 11:53 AM, Jianshi Huang <[email protected]> > wrote: > > > Hi Esteban, > > > > Two reasons to split dynamically, > > > > 1) I have a column family that stores timeseries data for mapreduce > tasks, > > and the rowkey is monotonically increasing to make scanning easier. > > > > 2) (a better reason), I'm storing multiple types of data in the same > table, > > and I have about 500TB of data in total. That's many billions of rows and > > many thousands of regions. I want to make sure ingesting one type of data > > won't touch every region which will cause a lot of fragments and merge > > operations, the rowkey is designed as <type>|<hash>|<id>. > > > > So either way I would want a dynamic split in my design. > > > > Jianshi > > > > > > On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez <[email protected] > > > > wrote: > > > > > Jianshi, > > > > > > The retry is not an expected behavior that the client should be doing. > In > > > fact you don't want your clients to issue admin operations to the > cluster > > > ;) > > > > > > Shahab's option is the best alternative by polling when the number of > > > regions has changed in the table you want to modify the splits > > dynamically. > > > The JIRA that Ted suggested requires modification in the core table > > > operations to support sync operations and requires some major work to > do > > it > > > right. Ted's alternative to create the splits at table creation time is > > the > > > best option if you can pre-split IMHO. > > > > > > If you could elaborate more on the practical reasons you mention to > > create > > > synchronously those new regions that would be great for us. Maybe its > > > related to multi-tenancy but I'm just guessing :) > > > > > > esteban. > > > > > > > > > -- > > > Cloudera, Inc. > > > > > > > > > On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu <[email protected]> wrote: > > > > > > > Jianshi: > > > > See HBASE-11608 Add synchronous split > > > > > > > > bq. createTable does something special? > > > > > > > > Yes. See this in HBaseAdmin: > > > > > > > > public void createTable(final HTableDescriptor desc, byte [][] > > > splitKeys) > > > > > > > > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang < > > [email protected] > > > > > > > > wrote: > > > > > > > > > I see Shahab, async makes sense, but I prefer that the HBase client > > > does > > > > > the retry for me, and let me specify a timeout parameter. > > > > > > > > > > One question, does that mean adding multiple splits into one region > > has > > > > to > > > > > be done sequentially? How can I add region splits in parallel? Does > > > > > createTable does something special? > > > > > > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus < > > [email protected]> > > > > > wrote: > > > > > > > > > > > Split is an async operation. When you call it, and the call > > returns, > > > it > > > > > > does not mean that the region has been created yet. > > > > > > > > > > > > So either you wait for a while (using Thread.sleep) or check for > > the > > > > > number > > > > > > of regions in a loop and until they have increased to the value > you > > > > want > > > > > > and then access the region. The former is not a good idea, though > > you > > > > can > > > > > > try it out just to make sure that this is indeed the issue. > > > > > > > > > > > > What am I suggesting is something like (pseudo code): > > > > > > > > > > > > while(new#regions > old#regions) > > > > > > { > > > > > > new#regions = admin.getLatest#regions > > > > > > } > > > > > > > > > > > > Regards, > > > > > > Shahab > > > > > > > > > > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang < > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > I constantly get the following errors when I tried to add > splits > > > to a > > > > > > > table. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): > > > > > > > org.apache.hadoop.hbase.NotServingRegionException: Region > > > > > > > > > > > > > > > > > > > > > > > > > > > > grapple_vertices,cust|rval#7ffffeb7cffca280|1636500018299676757,1410945568 > > > > > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on > > > > > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359 > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > But when I checked the region server (from hbase' webUI), the > > > region > > > > is > > > > > > > actually listed there. > > > > > > > > > > > > > > What does the error mean actually? How can I solve it? > > > > > > > > > > > > > > Currently I'm adding splits single-threaded, and I want to make > > it > > > > > > > parallel, is there anything I need to be careful about? > > > > > > > > > > > > > > Here's the code for adding splits: > > > > > > > > > > > > > > def addSplits(tableName: String, splitKeys: > Seq[Array[Byte]]): > > > Unit > > > > > = { > > > > > > > val admin = new HBaseAdmin(conn) > > > > > > > > > > > > > > try { > > > > > > > val regions = > > > admin.getTableRegions(tableName.getBytes("UTF8")) > > > > > > > val regionStartKeys = regions.map(_.getStartKey) > > > > > > > val splits = splitKeys.diff(regionStartKeys) > > > > > > > > > > > > > > splits.foreach { splitPoint => > > > > > > > admin.split(tableName.getBytes("UTF8"), splitPoint) > > > > > > > } > > > > > > > // NOTE: important! > > > > > > > admin.balancer() > > > > > > > } > > > > > > > finally { > > > > > > > admin.close() > > > > > > > } > > > > > > > } > > > > > > > > > > > > > > > > > > > > > Any help is appreciated. > > > > > > > > > > > > > > -- > > > > > > > Jianshi Huang > > > > > > > > > > > > > > LinkedIn: jianshi > > > > > > > Twitter: @jshuang > > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Jianshi Huang > > > > > > > > > > LinkedIn: jianshi > > > > > Twitter: @jshuang > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/
