Re: HFileOutputFormat2.configureIncrementalLoad and HREGION_MAX_FILESIZE

Tom Hood Fri, 23 Jan 2015 11:23:05 -0800

Thanks for the responses.  In a nutshell, I'm trying to manage the splits
myself and otherwise disable the usual region lifecycle stuff.

We have a number of hbase tables that are bulkloaded from scratch each
day.  Each of these tables consists of a single column family.  The row
keys all start with a hash so it is easy to determine the split keys for an
even distribution of the data across the cluster.

Aside from the bulkload, no additional updates are made to these tables
(i.e. once they are created, they are readonly until they are replaced the
next day).  Given the lifecycle of these tables, there seems to be no need
for hbase to split or compact them.  Agree?

I'm trying to configure the MR job so that it creates exactly the sized
hfiles I want.  The HFileOutputFormat2.RecordWriter is using the value from
hbase.hregion.max.filesize so it seemed like the configureIncrementalLoad
should be setting that to the value set on the table.  I didn't want to
change the global setting in hbase-site.xml, hence my desire to configure
this in the job.

I've attempted to completely disable splitting by setting
HREGION_MAX_FILESIZE to 100GB on the table descriptor, but in case we
happen to exceed that amount, I've also disabled compaction and set a split
policy of ConstantSizeRegionSplitPolicy and an algorithm of either
UniformSplit or HexStringSplit depending on the particular table's row key.

What factors should be considered to determine how large to make a single
hfile beyond making sure the block index cache fits in RS memory and enough
of the block data cache is available to be useful given the expected table
access patterns?

Is there any reason to allow more than one hfile per region in this
scenario?

Assume a RS services N rows of a particular table.  All those N rows are in
1 region.  Is there possibly an advantage to instead having that RS service
the same N rows, but in 2 regions of N/2 rows each?

Thanks,
-- Tom

On Fri, Jan 23, 2015 at 10:37 AM, Nick Dimiduk <[email protected]> wrote:

> Have a look at the code in HFileOutputFormat2#configureIncrementalLoad(Job,
> HTableDescriptor, RegionLocator). The HTableDescriptor is only used for
> details of writing blocks: compression, bloom filters, block size, and
> block encoding. Other table properties are left to the online table.
>
> I'm not sure what you're trying to accomplish by setting this value in your
> job. The size of the HFiles produced will be dependent on the data
> distribution characteristics of your MR job, not the online table. When you
> completeBulkLoad, those generated HFiles will be split according to the
> online table region boundaries, and loaded into the Regions. Once that
> happens, usual region lifecycle stuff picks up, meaning at that point, the
> online region will decide if/when to split based on store sizes.
>
> Hope that helps,
> -n
>
> On Fri, Jan 23, 2015 at 10:29 AM, Ted Yu <[email protected]> wrote:
>
> > Suppose the value used by bulk loading is different from that used by
> > region server, how would region server deal with two (or more) values
> > w.r.t. splitting ?
> >
> > Cheers
> >
> > On Fri, Jan 23, 2015 at 10:15 AM, Tom Hood <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I'm bulkloading into an empty hbase table and have called
> > > HTableDescriptor.setMaxFileSize to override the global setting of
> > > HConstants.HREGION_MAX_FILESIZE (i.e. hbase.hregion.max.filesize).
> > >
> > > This newly configured table is then passed to
> > > HFileOutputFormat2.configureIncrementalLoad to setup the MR job to
> > generate
> > > the hfiles.  This already configures other properties based on the
> > settings
> > > of the table it's given (e.g. compression, bloom, data encoding,
> splits,
> > > etc).  Is there a reason it does not also configure the
> > > HREGION_MAX_FILESIZE based on its setting from
> > > HTableDescriptor.getMaxFileSize?
> > >
> > > Thanks,
> > > -- Tom
> > >
> >
>

Re: HFileOutputFormat2.configureIncrementalLoad and HREGION_MAX_FILESIZE

Reply via email to