Hi all I'm running the CsvBulkLoadTool trying to pull in some data. The MapReduce Job appears to complete, and gives some promising information:
################################################################################ Phoenix MapReduce Import Upserts Done=600037902 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=79657289180 File Output Format Counters Bytes Written=176007436620 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xa24982f connecting to ZooKeeper ensemble=stl-colo-srv073.splicemachine.colo:2181 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=stl-colo-srv073.splicemachine.colo:2181 sessionTimeout=1200000 watcher=hconnection-0xa24982f0x0, quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not attempt to authenticate using SASL (unknown error) 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 0x15696476bf90484, negotiated timeout = 40000 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for TPCH.LINEITEM from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM 16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed connection cannot be used for bulkload. Creating unmanaged connection. 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x456a0752 connecting to ZooKeeper ensemble=stl-colo-srv073.splicemachine.colo:2181 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=stl-colo-srv073.splicemachine.colo:2181 sessionTimeout=1200000 watcher=hconnection-0x456a07520x0, quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not attempt to authenticate using SASL (unknown error) 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 0x15696476bf90485, negotiated timeout = 40000 16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled ################################################################################ and eventually errors out with this exception. ################################################################################ 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/88b40cbbc4c841f99eae906af3b93cda first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04 last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/de309e5c7b3841a6b4fd299ac8fa8728 first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01 last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/e7ed8bc150c9494b8c064a022b3609e0 first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04 last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/c35e01b66d85450c97da9bb21bfc650f first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04 last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/b5904451d27d42f0bcb4c98a5b14f3e9 first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01 last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/9d26e9a00e5149cabcb415c6bb429a34 first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04 last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05 16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load more than 32 hfiles to family 0 of region with start key 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15696476bf90485 16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485 closed 16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to one family of one region at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:420) at org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:314) at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.completebulkload(AbstractBulkLoadTool.java:355) at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(AbstractBulkLoadTool.java:332) at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:270) at org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:183) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) ################################################################################ a count of the table showa 0 rows: 0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM; +-----------+ | COUNT(1) | +-----------+ | 0 | +-----------+ Some quick googling gives an hbase param that could be tweaked (http://stackoverflow.com/questions/24950393/trying-to-load-more-than-32-hfiles-to-one-family-of-one-region). Main Questions: - Will the CsvBulkLoadTool pick up these params, or will I need to put them in hbase-site.xml? - Is there anything else I can tune to make this run quicker? It took 5 hours for it to fail with the error above. This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix 4.8.0-HBase-1.1 Ambari default settings except for: - HBase RS heap size is set to 24GB - hbase.rpc.timeout set to 20 min - phoenix.query.timeoutMs set to 60 min all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM