MR Bulkload tool sounds promising. Is there a link that provides some instructions?
Does it take a HDFS folder as input? Or a Hive table? Thanks! From: Puneet Kumar Ojha [mailto:[email protected]] Sent: Monday, June 15, 2015 7:10 AM To: [email protected] Subject: RE: Guidance on table splitting Can you provide the Queries which you would be running on your table? Also use the MR Bulkload instead of using the CSV load tool. From: Riesland, Zack [mailto:[email protected]] Sent: Monday, June 15, 2015 4:03 PM To: [email protected]<mailto:[email protected]> Subject: Guidance on table splitting I'm new to Hbase and to Phoenix. I needed to build a GUI off of a huge data set from HDFS, so I decided to create a couple of Phoenix tables, dump the data using the CSV bulk load tool, and serve the GUI from there. This all 'works', but as the data set grows, I would like to improve my table design. Currently, I have a table (6 region servers) with about 8 billion rows and about a dozen columns. The primary key is a combination of customer code (4 letters) and serial number (8-12 digits). So I split the table with the idea of creating 2-3 regions per starting letter: SPLIT ON ('AM', 'AZ', 'BK', 'BZ', 'CE', 'CM', 'CZ', 'DK', 'DZ', 'EK', 'EZ', 'FK', 'FZ', 'GK', 'GZ', 'HK', 'HZ', 'IK', 'IZ', 'JK', 'JZ', 'KK', 'KZ', 'LF', 'LZ', 'MK', 'MZ', 'NK', 'NZ', 'OK', 'OZ', 'PK', 'PZ', 'RK', 'RZ', 'SK', 'SZ', 'TK', 'TZ', 'UK', 'UZ', 'VK', 'VZ', 'WK', 'WW', 'WZ'); This performs somewhat better than if I just create the table and give no guidance to Phoenix. But I'm wondering if I could do better. Key-based queries are very fast, but data ingest is surprisingly slow. Ingesting 1 billion rows takes on the order of hours. When I look at the stats, this table has a fairly skewed distribution of data across 6 regions servers. Something like 15 regions, 13, 13, 3, 3, and 2. Can anyone give me some guidance on how to improve this design? Really, any suggestions at this point would be much appreciated, as I'm just getting started.
