Re: CsvBulkLoadTool with ~75GB file

2016-08-19 Thread James Taylor
Maybe this will help? http://phoenix.apache.org/bulk_dataload.html#Permissions_issues_when_uploading_HFiles bq. I struggle to understand how to use split points in the create statement. You can't always use split points - it depends on your schema and the knowledge you have about the data being

Re: CsvBulkLoadTool with ~75GB file

2016-08-19 Thread John Leach
Aaron, Looks like a permission issue? org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint threw java.lang.IllegalStateException: Failed to get FileSystem instance java.lang.IllegalStateException: Failed to get FileSystem instance at

Re: CsvBulkLoadTool with ~75GB file

2016-08-19 Thread John Leach
Gabriel, Thanks for the response I appreciate it. I struggle to understand how to use split points in the create statement. (1) Creating a table with Split Points: CREATE TABLE stats.prod_metrics ( host char(50) not null, created_date date not null, txn_count bigint CONSTRAINT pk

Re: CsvBulkLoadTool with ~75GB file

2016-08-19 Thread Gabriel Reid
Hi John, You can actually pre-split a table when creating it, either by specifying split points in the CREATE TABLE statement[1] or by using salt buckets[2]. In my current use cases I always use salting, but take a look at the salting documentation[2] for the pros and cons of this. Your

Re: CsvBulkLoadTool with ~75GB file

2016-08-19 Thread Gabriel Reid
Hi Aaron, How many regions are there in the LINEITEM table? The fact that you needed to bump the hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily setting up to 48 suggests that the amount of data going into a single region of that table is probably pretty large. Along the same line, I

Re: CsvBulkLoadTool with ~75GB file

2016-08-18 Thread Aaron Molitor
Gabriel, Thanks for the help, it's good to know that those params can be passed from the command line and that the order is important. I am trying to load the 100GB TPC-H data set and ultimately run the TPC-H queries. All of the tables loaded relatively easily except LINEITEM (the

Re: CsvBulkLoadTool with ~75GB file

2016-08-18 Thread Gabriel Reid
Hi Aaron, I'll answered your questions directly first, but please see the bottom part of this mail for important additional details. You can specify the "hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily" parameter (referenced from your StackOverflow link) on the command line of you

CsvBulkLoadTool with ~75GB file

2016-08-17 Thread Aaron Molitor
Hi all I'm running the CsvBulkLoadTool trying to pull in some data. The MapReduce Job appears to complete, and gives some promising information: Phoenix MapReduce Import Upserts