In the past, my struggles with hbase/phoenix have been related to data ingest.
Each night, we ingest lots of data via CsvBulkUpload.
After lots of trial and error trying to get our largest table to cooperate, I
found a primary key that distributes well if I specify the split criteria on
table creation.
That table now has ~15 billion rows representing about 300GB of data across 513
regions (on 9 region servers).
Life was good for a while.
Now, I have a new use case where I need another table very similar, but rather
than serving UI-based reports, this table will be queried programmatically and
VERY heavily (millions of queries per day).
I have asked about this in the past, but got derailed to other things, so I'm
trying to zoom out a bit and make sure I approach this problem correctly.
My simplified use case is basically: de-dup input files against Phoenix before
passing them on to the rest of our ingest process. This will result in tens of
thousands of queries to Phoenix per input file.
I noted in the past that after 5-10K rapid-fire queries, the response time
drops dramatically. And I think we established that this is because there is
one thread being spawned per 20-mb chunk of data in each region (?)
More generally, it seems that the more regions there are in my table, the more
resource-intensive phoenix queries become?
Is that correct?
I estimate that my table will contain about 500GB of data by the end of 2016.
The rows are pretty small (like 6 or 8 small columns). I have 9 region servers
- soon to be 12.
The distribution is usually 2,000-5,000 rows per primary key, which is about
0.5 - 3 MB of data.
Given that information, is there a good rule of thumb for how many regions I
should try to target with my schema/primary key design?
I experimented using salt buckets (presumably letting Phoenix choose how to
split everything) but I keep getting errors when I try to bulk load data into
salted tables ("Import job on table blah failed due to exception:
java.io.IOException: Trying to load more than 32 hfiles to one family of one
region").
Are there HBase configuration tweaks I should focus on? My current memstore
size is set to 256mb.
Thanks for any guidance or tips here.