MR Bulkload tool sounds promising.

Is there a link that provides some instructions?

Does it take a HDFS folder as input? Or a Hive table?

Thanks!

From: Puneet Kumar Ojha [mailto:[email protected]]
Sent: Monday, June 15, 2015 7:10 AM
To: [email protected]
Subject: RE: Guidance on table splitting

Can you provide the Queries which you would be running on your table?

Also use the MR Bulkload instead of using the CSV load tool.



From: Riesland, Zack [mailto:[email protected]]
Sent: Monday, June 15, 2015 4:03 PM
To: [email protected]<mailto:[email protected]>
Subject: Guidance on table splitting

I'm new to Hbase and to Phoenix.

I needed to build a GUI off of a huge data set from HDFS, so I decided to 
create a couple of Phoenix tables, dump the data using the CSV bulk load tool, 
and serve the GUI from there.

This all 'works', but as the data set grows, I would like to improve my table 
design.

Currently, I have a table (6 region servers) with about 8 billion rows and 
about a dozen columns.

The primary key is a combination of customer code (4 letters) and serial number 
(8-12 digits).

So I split the table with the idea of creating 2-3 regions per starting letter:

SPLIT ON ('AM', 'AZ', 'BK', 'BZ', 'CE', 'CM', 'CZ', 'DK', 'DZ', 'EK', 'EZ', 
'FK', 'FZ', 'GK', 'GZ', 'HK', 'HZ', 'IK', 'IZ', 'JK', 'JZ', 'KK', 'KZ', 'LF', 
'LZ', 'MK', 'MZ', 'NK', 'NZ', 'OK', 'OZ', 'PK', 'PZ', 'RK', 'RZ', 'SK', 'SZ', 
'TK', 'TZ', 'UK', 'UZ', 'VK', 'VZ', 'WK', 'WW', 'WZ');

This performs  somewhat better than if I just create the table and give no 
guidance to Phoenix.

But I'm wondering if I could do better. Key-based queries are very fast, but 
data ingest is surprisingly slow. Ingesting 1 billion rows takes on the order 
of hours.

When I look at the stats, this table has a fairly skewed distribution of data 
across 6 regions servers. Something like 15 regions, 13, 13, 3, 3, and 2.

Can anyone give me some guidance on how to improve this design?

Really, any suggestions at this point would be much appreciated, as I'm just 
getting started.


Reply via email to