Thanks for the responses. In a nutshell, I'm trying to manage the splits myself and otherwise disable the usual region lifecycle stuff.
We have a number of hbase tables that are bulkloaded from scratch each day. Each of these tables consists of a single column family. The row keys all start with a hash so it is easy to determine the split keys for an even distribution of the data across the cluster. Aside from the bulkload, no additional updates are made to these tables (i.e. once they are created, they are readonly until they are replaced the next day). Given the lifecycle of these tables, there seems to be no need for hbase to split or compact them. Agree? I'm trying to configure the MR job so that it creates exactly the sized hfiles I want. The HFileOutputFormat2.RecordWriter is using the value from hbase.hregion.max.filesize so it seemed like the configureIncrementalLoad should be setting that to the value set on the table. I didn't want to change the global setting in hbase-site.xml, hence my desire to configure this in the job. I've attempted to completely disable splitting by setting HREGION_MAX_FILESIZE to 100GB on the table descriptor, but in case we happen to exceed that amount, I've also disabled compaction and set a split policy of ConstantSizeRegionSplitPolicy and an algorithm of either UniformSplit or HexStringSplit depending on the particular table's row key. What factors should be considered to determine how large to make a single hfile beyond making sure the block index cache fits in RS memory and enough of the block data cache is available to be useful given the expected table access patterns? Is there any reason to allow more than one hfile per region in this scenario? Assume a RS services N rows of a particular table. All those N rows are in 1 region. Is there possibly an advantage to instead having that RS service the same N rows, but in 2 regions of N/2 rows each? Thanks, -- Tom On Fri, Jan 23, 2015 at 10:37 AM, Nick Dimiduk <[email protected]> wrote: > Have a look at the code in HFileOutputFormat2#configureIncrementalLoad(Job, > HTableDescriptor, RegionLocator). The HTableDescriptor is only used for > details of writing blocks: compression, bloom filters, block size, and > block encoding. Other table properties are left to the online table. > > I'm not sure what you're trying to accomplish by setting this value in your > job. The size of the HFiles produced will be dependent on the data > distribution characteristics of your MR job, not the online table. When you > completeBulkLoad, those generated HFiles will be split according to the > online table region boundaries, and loaded into the Regions. Once that > happens, usual region lifecycle stuff picks up, meaning at that point, the > online region will decide if/when to split based on store sizes. > > Hope that helps, > -n > > On Fri, Jan 23, 2015 at 10:29 AM, Ted Yu <[email protected]> wrote: > > > Suppose the value used by bulk loading is different from that used by > > region server, how would region server deal with two (or more) values > > w.r.t. splitting ? > > > > Cheers > > > > On Fri, Jan 23, 2015 at 10:15 AM, Tom Hood <[email protected]> wrote: > > > > > Hi, > > > > > > I'm bulkloading into an empty hbase table and have called > > > HTableDescriptor.setMaxFileSize to override the global setting of > > > HConstants.HREGION_MAX_FILESIZE (i.e. hbase.hregion.max.filesize). > > > > > > This newly configured table is then passed to > > > HFileOutputFormat2.configureIncrementalLoad to setup the MR job to > > generate > > > the hfiles. This already configures other properties based on the > > settings > > > of the table it's given (e.g. compression, bloom, data encoding, > splits, > > > etc). Is there a reason it does not also configure the > > > HREGION_MAX_FILESIZE based on its setting from > > > HTableDescriptor.getMaxFileSize? > > > > > > Thanks, > > > -- Tom > > > > > >
