CsvBulkLoadTool with ~75GB file

Aaron Molitor Wed, 17 Aug 2016 18:48:06 -0700

Hi all I'm running the CsvBulkLoadTool trying to pull in some data.  The 
MapReduce Job appears to complete, and gives some promising information:



################################################################################
        Phoenix MapReduce Import
                Upserts Done=600037902
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=79657289180
        File Output Format Counters 
                Bytes Written=176007436620
16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles from 
/tmp/66f905f4-3d62-45bf-85fe-c247f518355c
16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process 
identifier=hconnection-0xa24982f connecting to ZooKeeper 
ensemble=stl-colo-srv073.splicemachine.colo:2181
16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=stl-colo-srv073.splicemachine.colo:2181 sessionTimeout=1200000 
watcher=hconnection-0xa24982f0x0, 
quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to 
server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not attempt to 
authenticate using SASL (unknown error)
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to 
stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on 
server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 
0x15696476bf90484, negotiated timeout = 40000
16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for 
TPCH.LINEITEM from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM
16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed connection 
cannot be used for bulkload. Creating unmanaged connection.
16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process 
identifier=hconnection-0x456a0752 connecting to ZooKeeper 
ensemble=stl-colo-srv073.splicemachine.colo:2181
16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, 
connectString=stl-colo-srv073.splicemachine.colo:2181 sessionTimeout=1200000 
watcher=hconnection-0x456a07520x0, 
quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to 
server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not attempt to 
authenticate using SASL (unknown error)
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to 
stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session
16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete on 
server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 
0x15696476bf90485, negotiated timeout = 40000
16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled
################################################################################

and eventually errors out with this exception. 

################################################################################
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/88b40cbbc4c841f99eae906af3b93cda
 first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04 
last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/de309e5c7b3841a6b4fd299ac8fa8728
 first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01 
last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/e7ed8bc150c9494b8c064a022b3609e0
 first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04 
last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/c35e01b66d85450c97da9bb21bfc650f
 first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04 
last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/b5904451d27d42f0bcb4c98a5b14f3e9
 first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01 
last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01
16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/9d26e9a00e5149cabcb415c6bb429a34
 first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04 
last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05
16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load more 
than 32 hfiles to family 0 of region with start key 
16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: 
Closing master protocol: MasterService
16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: 
Closing zookeeper sessionid=0x15696476bf90485
16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485 closed
16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" java.io.IOException: Trying to load more than 32 
hfiles to one family of one region
        at 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:420)
        at 
org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:314)
        at 
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.completebulkload(AbstractBulkLoadTool.java:355)
        at 
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(AbstractBulkLoadTool.java:332)
        at 
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:270)
        at 
org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:183)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at 
org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:101)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
################################################################################

a count of the table showa 0 rows:
0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM;
+-----------+
| COUNT(1)  |
+-----------+
| 0         |
+-----------+

Some quick googling gives an hbase param that could be tweaked 
(http://stackoverflow.com/questions/24950393/trying-to-load-more-than-32-hfiles-to-one-family-of-one-region).
 

Main Questions:
- Will the CsvBulkLoadTool pick up these params, or will I need to put them in 
hbase-site.xml? 
- Is there anything else I can tune to make this run quicker? It took 5 hours 
for it to fail with the error above.

This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix 
4.8.0-HBase-1.1
Ambari default settings except for:
- HBase RS heap size is set to 24GB
- hbase.rpc.timeout set to 20 min
- phoenix.query.timeoutMs set to 60 min

all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM

CsvBulkLoadTool with ~75GB file

Reply via email to