bq. RS service the same N rows, but in 2 regions of N/2 rows each In the above case the client would be able to issue parallel scans on this server.
Cheers On Fri, Jan 23, 2015 at 11:21 AM, Tom Hood <[email protected]> wrote: > Thanks for the responses. In a nutshell, I'm trying to manage the splits > myself and otherwise disable the usual region lifecycle stuff. > > We have a number of hbase tables that are bulkloaded from scratch each > day. Each of these tables consists of a single column family. The row > keys all start with a hash so it is easy to determine the split keys for an > even distribution of the data across the cluster. > > Aside from the bulkload, no additional updates are made to these tables > (i.e. once they are created, they are readonly until they are replaced the > next day). Given the lifecycle of these tables, there seems to be no need > for hbase to split or compact them. Agree? > > I'm trying to configure the MR job so that it creates exactly the sized > hfiles I want. The HFileOutputFormat2.RecordWriter is using the value from > hbase.hregion.max.filesize so it seemed like the configureIncrementalLoad > should be setting that to the value set on the table. I didn't want to > change the global setting in hbase-site.xml, hence my desire to configure > this in the job. > > I've attempted to completely disable splitting by setting > HREGION_MAX_FILESIZE to 100GB on the table descriptor, but in case we > happen to exceed that amount, I've also disabled compaction and set a split > policy of ConstantSizeRegionSplitPolicy and an algorithm of either > UniformSplit or HexStringSplit depending on the particular table's row key. > > What factors should be considered to determine how large to make a single > hfile beyond making sure the block index cache fits in RS memory and enough > of the block data cache is available to be useful given the expected table > access patterns? > > Is there any reason to allow more than one hfile per region in this > scenario? > > Assume a RS services N rows of a particular table. All those N rows are in > 1 region. Is there possibly an advantage to instead having that RS service > the same N rows, but in 2 regions of N/2 rows each? > > Thanks, > -- Tom > > On Fri, Jan 23, 2015 at 10:37 AM, Nick Dimiduk <[email protected]> wrote: > > > Have a look at the code in > HFileOutputFormat2#configureIncrementalLoad(Job, > > HTableDescriptor, RegionLocator). The HTableDescriptor is only used for > > details of writing blocks: compression, bloom filters, block size, and > > block encoding. Other table properties are left to the online table. > > > > I'm not sure what you're trying to accomplish by setting this value in > your > > job. The size of the HFiles produced will be dependent on the data > > distribution characteristics of your MR job, not the online table. When > you > > completeBulkLoad, those generated HFiles will be split according to the > > online table region boundaries, and loaded into the Regions. Once that > > happens, usual region lifecycle stuff picks up, meaning at that point, > the > > online region will decide if/when to split based on store sizes. > > > > Hope that helps, > > -n > > > > On Fri, Jan 23, 2015 at 10:29 AM, Ted Yu <[email protected]> wrote: > > > > > Suppose the value used by bulk loading is different from that used by > > > region server, how would region server deal with two (or more) values > > > w.r.t. splitting ? > > > > > > Cheers > > > > > > On Fri, Jan 23, 2015 at 10:15 AM, Tom Hood <[email protected]> > wrote: > > > > > > > Hi, > > > > > > > > I'm bulkloading into an empty hbase table and have called > > > > HTableDescriptor.setMaxFileSize to override the global setting of > > > > HConstants.HREGION_MAX_FILESIZE (i.e. hbase.hregion.max.filesize). > > > > > > > > This newly configured table is then passed to > > > > HFileOutputFormat2.configureIncrementalLoad to setup the MR job to > > > generate > > > > the hfiles. This already configures other properties based on the > > > settings > > > > of the table it's given (e.g. compression, bloom, data encoding, > > splits, > > > > etc). Is there a reason it does not also configure the > > > > HREGION_MAX_FILESIZE based on its setting from > > > > HTableDescriptor.getMaxFileSize? > > > > > > > > Thanks, > > > > -- Tom > > > > > > > > > >
