Hi Esteban,

Two reasons to split dynamically,

1) I have a column family that stores timeseries data for mapreduce tasks,
and the rowkey is monotonically increasing to make scanning easier.

2) (a better reason), I'm storing multiple types of data in the same table,
and I have about 500TB of data in total. That's many billions of rows and
many thousands of regions. I want to make sure ingesting one type of data
won't touch every region which will cause a lot of fragments and merge
operations, the rowkey is designed as <type>|<hash>|<id>.

So either way I would want a dynamic split in my design.

Jianshi


On Thu, Sep 18, 2014 at 2:39 AM, Esteban Gutierrez <[email protected]>
wrote:

> Jianshi,
>
> The retry is not an expected behavior that the client should be doing. In
> fact you don't want your clients to issue admin operations to the cluster
> ;)
>
> Shahab's option is the best alternative by polling when the number of
> regions has changed in the table you want to modify the splits dynamically.
> The JIRA that Ted suggested requires modification in the core table
> operations to support sync operations and requires some major work to do it
> right. Ted's alternative to create the splits at table creation time is the
> best option if you can pre-split IMHO.
>
> If you could elaborate more on the practical reasons you mention to create
> synchronously those new regions that would be great for us. Maybe its
> related to multi-tenancy but I'm just guessing :)
>
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Wed, Sep 17, 2014 at 11:09 AM, Ted Yu <[email protected]> wrote:
>
> > Jianshi:
> > See HBASE-11608 Add synchronous split
> >
> > bq. createTable does something special?
> >
> > Yes. See this in HBaseAdmin:
> >
> >   public void createTable(final HTableDescriptor desc, byte [][]
> splitKeys)
> >
> > On Wed, Sep 17, 2014 at 10:58 AM, Jianshi Huang <[email protected]
> >
> > wrote:
> >
> > > I see Shahab, async makes sense, but I prefer that the HBase client
> does
> > > the retry for me, and let me specify a timeout parameter.
> > >
> > > One question, does that mean adding multiple splits into one region has
> > to
> > > be done sequentially? How can I add region splits in parallel? Does
> > > createTable does something special?
> > >
> > >
> > > Jianshi
> > >
> > >
> > > On Wed, Sep 17, 2014 at 8:06 PM, Shahab Yunus <[email protected]>
> > > wrote:
> > >
> > > > Split is an async operation. When you call it, and the call returns,
> it
> > > > does not mean that the region has been created yet.
> > > >
> > > > So either you wait for a while (using Thread.sleep) or check for the
> > > number
> > > > of regions in a loop and until they have increased to the value you
> > want
> > > > and then access the region. The former is not a good idea, though you
> > can
> > > > try it out just to make sure that this is indeed the issue.
> > > >
> > > > What am I suggesting is something like (pseudo code):
> > > >
> > > > while(new#regions > old#regions)
> > > > {
> > > >    new#regions = admin.getLatest#regions
> > > > }
> > > >
> > > > Regards,
> > > > Shahab
> > > >
> > > > On Wed, Sep 17, 2014 at 5:39 AM, Jianshi Huang <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > I constantly get the following errors when I tried to add splits
> to a
> > > > > table.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
> > > > > org.apache.hadoop.hbase.NotServingRegionException: Region
> > > > >
> > > >
> > >
> >
> grapple_vertices,cust|rval#7ffffeb7cffca280|1636500018299676757,1410945568
> > > > > 484.e7743495366df3c82a8571b36c2bdac3. is not online on
> > > > > lvshdc5dn0193.lvs.paypal.com,60020,1405014719359
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.splitRegion(HRegionServer.java:3818)
> > > > >         at
> > > > >
> > > > >
> > > > > But when I checked the region server (from hbase' webUI), the
> region
> > is
> > > > > actually listed there.
> > > > >
> > > > > What does the error mean actually? How can I solve it?
> > > > >
> > > > > Currently I'm adding splits single-threaded, and I want to make it
> > > > > parallel, is there anything I need to be careful about?
> > > > >
> > > > > Here's the code for adding splits:
> > > > >
> > > > >   def addSplits(tableName: String, splitKeys: Seq[Array[Byte]]):
> Unit
> > > = {
> > > > >     val admin = new HBaseAdmin(conn)
> > > > >
> > > > >     try {
> > > > >       val regions =
> admin.getTableRegions(tableName.getBytes("UTF8"))
> > > > >       val regionStartKeys = regions.map(_.getStartKey)
> > > > >       val splits = splitKeys.diff(regionStartKeys)
> > > > >
> > > > >       splits.foreach { splitPoint =>
> > > > >         admin.split(tableName.getBytes("UTF8"), splitPoint)
> > > > >       }
> > > > >       // NOTE: important!
> > > > >       admin.balancer()
> > > > >     }
> > > > >     finally {
> > > > >       admin.close()
> > > > >     }
> > > > >   }
> > > > >
> > > > >
> > > > > Any help is appreciated.
> > > > >
> > > > > --
> > > > > Jianshi Huang
> > > > >
> > > > > LinkedIn: jianshi
> > > > > Twitter: @jshuang
> > > > > Github & Blog: http://huangjs.github.com/
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to