Re: HFileOutputFormat2.configureIncrementalLoad and HREGION_MAX_FILESIZE

Ted Yu Fri, 23 Jan 2015 11:33:31 -0800

bq. RS service the same N rows, but in 2 regions of N/2 rows each

In the above case the client would be able to issue parallel scans on this
server.


Cheers

On Fri, Jan 23, 2015 at 11:21 AM, Tom Hood <[email protected]> wrote:

> Thanks for the responses.  In a nutshell, I'm trying to manage the splits
> myself and otherwise disable the usual region lifecycle stuff.
>
> We have a number of hbase tables that are bulkloaded from scratch each
> day.  Each of these tables consists of a single column family.  The row
> keys all start with a hash so it is easy to determine the split keys for an
> even distribution of the data across the cluster.
>
> Aside from the bulkload, no additional updates are made to these tables
> (i.e. once they are created, they are readonly until they are replaced the
> next day).  Given the lifecycle of these tables, there seems to be no need
> for hbase to split or compact them.  Agree?
>
> I'm trying to configure the MR job so that it creates exactly the sized
> hfiles I want.  The HFileOutputFormat2.RecordWriter is using the value from
> hbase.hregion.max.filesize so it seemed like the configureIncrementalLoad
> should be setting that to the value set on the table.  I didn't want to
> change the global setting in hbase-site.xml, hence my desire to configure
> this in the job.
>
> I've attempted to completely disable splitting by setting
> HREGION_MAX_FILESIZE to 100GB on the table descriptor, but in case we
> happen to exceed that amount, I've also disabled compaction and set a split
> policy of ConstantSizeRegionSplitPolicy and an algorithm of either
> UniformSplit or HexStringSplit depending on the particular table's row key.
>
> What factors should be considered to determine how large to make a single
> hfile beyond making sure the block index cache fits in RS memory and enough
> of the block data cache is available to be useful given the expected table
> access patterns?
>
> Is there any reason to allow more than one hfile per region in this
> scenario?
>
> Assume a RS services N rows of a particular table.  All those N rows are in
> 1 region.  Is there possibly an advantage to instead having that RS service
> the same N rows, but in 2 regions of N/2 rows each?
>
> Thanks,
> -- Tom
>
> On Fri, Jan 23, 2015 at 10:37 AM, Nick Dimiduk <[email protected]> wrote:
>
> > Have a look at the code in
> HFileOutputFormat2#configureIncrementalLoad(Job,
> > HTableDescriptor, RegionLocator). The HTableDescriptor is only used for
> > details of writing blocks: compression, bloom filters, block size, and
> > block encoding. Other table properties are left to the online table.
> >
> > I'm not sure what you're trying to accomplish by setting this value in
> your
> > job. The size of the HFiles produced will be dependent on the data
> > distribution characteristics of your MR job, not the online table. When
> you
> > completeBulkLoad, those generated HFiles will be split according to the
> > online table region boundaries, and loaded into the Regions. Once that
> > happens, usual region lifecycle stuff picks up, meaning at that point,
> the
> > online region will decide if/when to split based on store sizes.
> >
> > Hope that helps,
> > -n
> >
> > On Fri, Jan 23, 2015 at 10:29 AM, Ted Yu <[email protected]> wrote:
> >
> > > Suppose the value used by bulk loading is different from that used by
> > > region server, how would region server deal with two (or more) values
> > > w.r.t. splitting ?
> > >
> > > Cheers
> > >
> > > On Fri, Jan 23, 2015 at 10:15 AM, Tom Hood <[email protected]>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm bulkloading into an empty hbase table and have called
> > > > HTableDescriptor.setMaxFileSize to override the global setting of
> > > > HConstants.HREGION_MAX_FILESIZE (i.e. hbase.hregion.max.filesize).
> > > >
> > > > This newly configured table is then passed to
> > > > HFileOutputFormat2.configureIncrementalLoad to setup the MR job to
> > > generate
> > > > the hfiles.  This already configures other properties based on the
> > > settings
> > > > of the table it's given (e.g. compression, bloom, data encoding,
> > splits,
> > > > etc).  Is there a reason it does not also configure the
> > > > HREGION_MAX_FILESIZE based on its setting from
> > > > HTableDescriptor.getMaxFileSize?
> > > >
> > > > Thanks,
> > > > -- Tom
> > > >
> > >
> >
>

Re: HFileOutputFormat2.configureIncrementalLoad and HREGION_MAX_FILESIZE

Reply via email to