Hi Zack, The limitation of 32 HFiles is due to this configuration property MAX_FILES_PER_REGION_PER_FAMILY which defaults to 32 in LoadIncrementalHFiles. You can give it a try updating your configuration with a larger value and see if it works.
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java#L116 Thanks Ravi On Mon, Jan 18, 2016 at 9:57 AM, Riesland, Zack <[email protected]> wrote: > In the past, my struggles with hbase/phoenix have been related to data > ingest. > > > > Each night, we ingest lots of data via CsvBulkUpload. > > > > After lots of trial and error trying to get our largest table to > cooperate, I found a primary key that distributes well if I specify the > split criteria on table creation. > > > > That table now has ~15 billion rows representing about 300GB of data > across 513 regions (on 9 region servers). > > > > Life was good for a while. > > > > Now, I have a new use case where I need another table very similar, but > rather than serving UI-based reports, this table will be queried > programmatically and VERY heavily (millions of queries per day). > > > > I have asked about this in the past, but got derailed to other things, so > I’m trying to zoom out a bit and make sure I approach this problem > correctly. > > > > My simplified use case is basically: de-dup input files against Phoenix > before passing them on to the rest of our ingest process. This will result > in tens of thousands of queries to Phoenix per input file. > > > > I noted in the past that after 5-10K rapid-fire queries, the response time > drops dramatically. And I think we established that this is because there > is one thread being spawned per 20-mb chunk of data in each region (?) > > > > More generally, it seems that the more regions there are in my table, the > more resource-intensive phoenix queries become? > > > > Is that correct? > > > > I estimate that my table will contain about 500GB of data by the end of > 2016. > > > > The rows are pretty small (like 6 or 8 small columns). I have 9 region > servers – soon to be 12. > > > > The distribution is usually 2,000-5,000 rows per primary key, which is > about 0.5 – 3 MB of data. > > > > Given that information, is there a good rule of thumb for how many regions > I should try to target with my schema/primary key design? > > > > I experimented using salt buckets (presumably letting Phoenix choose how > to split everything) but I keep getting errors when I try to bulk load data > into salted tables (“Import job on table blah failed due to exception: > java.io.IOException: Trying to load more than 32 hfiles to one family of > one region”). > > > > Are there HBase configuration tweaks I should focus on? My current > memstore size is set to 256mb. > > > > Thanks for any guidance or tips here. > > > > > > >
