In the past, my struggles with hbase/phoenix have been related to data ingest.

Each night, we ingest lots of data via CsvBulkUpload.

After lots of trial and error trying to get our largest table to cooperate, I 
found a primary key that distributes well if I specify the split criteria on 
table creation.

That table now has ~15 billion rows representing about 300GB of data across 513 
regions (on 9 region servers).

Life was good for a while.

Now, I have a new use case where I need another table very similar, but rather 
than serving UI-based reports, this table will be queried programmatically and 
VERY heavily (millions of queries per day).

I have asked about this in the past, but got derailed to other things, so I'm 
trying to zoom out a bit and make sure I approach this problem correctly.

My simplified use case is basically: de-dup input files against Phoenix before 
passing them on to the rest of our ingest process. This will result in tens of 
thousands of queries to Phoenix per input file.

I noted in the past that after 5-10K rapid-fire queries, the response time 
drops dramatically. And I think we established that this is because there is 
one thread being spawned per 20-mb chunk of data in each region (?)

More generally, it seems that the more regions there are in my table, the more 
resource-intensive phoenix queries become?

Is that correct?

I estimate that my table will contain about 500GB of data by the end of 2016.

The rows are pretty small (like 6 or 8 small columns). I have 9 region servers 
- soon to be 12.

The distribution is usually 2,000-5,000 rows per primary key, which is about 
0.5 - 3 MB of data.

Given that information, is there a good rule of thumb for how many regions I 
should try to target with my schema/primary key design?

I experimented using salt buckets (presumably letting Phoenix choose how to 
split everything) but I keep getting errors when I try to bulk load data into 
salted tables ("Import job on table blah failed due to exception: 
java.io.IOException: Trying to load more than 32 hfiles to one family of one 
region").

Are there HBase configuration tweaks I should focus on? My current memstore 
size is set to 256mb.

Thanks for any guidance or tips here.



Reply via email to