Re: CsvBulkLoadTool with ~75GB file

Gabriel Reid Thu, 18 Aug 2016 02:16:08 -0700

Hi Aaron,

I'll answered your questions directly first, but please see the bottom
part of this mail for important additional details.


You can specify the
"hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily" parameter
(referenced from your StackOverflow link) on the command line of you
CsvBulkLoadTool command -- my understanding is that this is a purely
client-side parameter. You would provide it via -D as follows:

    hadoop jar phoenix-<version>-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool
-Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 <other
command-line parameters>

The important point in the above example is that config-based
parameters specified with -D are given before the application-level
parameters, and after the class name to be run.

>From my read of the HBase code, in this context you can also specify
the "hbase.hregion.max.filesize" parameter in the same way (in this
context it's a client-side parameter).

As far as speeding things up, the main points to consider are:
- ensure that compression is enabled for map-reduce jobs on your
cluster -- particularly map-output (intermediate) compression - see
https://datameer.zendesk.com/hc/en-us/articles/204258750-How-to-Use-Intermediate-and-Final-Output-Compression-MR1-YARN-
for a good overview
- check the ratio of map output records vs spilled records in the
counters on the import job. If the spilled records are higher than map
output records (e.g. twice as high or three times as high), then you
will probably benefit from raising the mapreduce.task.io.sort.mb
setting (see 
https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml)

Now those are the answers to your questions, but I'm curious about why
you're getting more than 32 HFiles in a single column family of a
single region. I assume that this means that you're loading large
amounts of data into a small number of regions. This is probably not a
good thing -- it may impact performance of HBase in general (because
each region has such a large amount of data), and will also have a
very negative impact on the running time of your import job (because
part of the parallelism of the import job is determined by the number
of regions being written to). I don't think you mentioned how many
regions you have on your table that you're importing to, but
increasing the number of regions will likely resolve several problems
for you. Another reason to do this is the fact that HBase will likely
start splitting your regions after this import due to their size.

- Gabriel


On Thu, Aug 18, 2016 at 3:47 AM, Aaron Molitor
<amoli...@splicemachine.com> wrote:
> Hi all I'm running the CsvBulkLoadTool trying to pull in some data.  The 
> MapReduce Job appears to complete, and gives some promising information:
>
>
> ################################################################################
>         Phoenix MapReduce Import
>                 Upserts Done=600037902
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=79657289180
>         File Output Format Counters
>                 Bytes Written=176007436620
> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles from 
> /tmp/66f905f4-3d62-45bf-85fe-c247f518355c
> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process 
> identifier=hconnection-0xa24982f connecting to ZooKeeper 
> ensemble=stl-colo-srv073.splicemachine.colo:2181
> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, 
> connectString=stl-colo-srv073.splicemachine.colo:2181 sessionTimeout=1200000 
> watcher=hconnection-0xa24982f0x0, 
> quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to 
> server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not attempt 
> to authenticate using SASL (unknown error)
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to 
> stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete 
> on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 
> 0x15696476bf90484, negotiated timeout = 40000
> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for 
> TPCH.LINEITEM from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM
> 16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed connection 
> cannot be used for bulkload. Creating unmanaged connection.
> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process 
> identifier=hconnection-0x456a0752 connecting to ZooKeeper 
> ensemble=stl-colo-srv073.splicemachine.colo:2181
> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, 
> connectString=stl-colo-srv073.splicemachine.colo:2181 sessionTimeout=1200000 
> watcher=hconnection-0x456a07520x0, 
> quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to 
> server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not attempt 
> to authenticate using SASL (unknown error)
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection established to 
> stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, initiating session
> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment complete 
> on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, sessionid = 
> 0x15696476bf90485, negotiated timeout = 40000
> 16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled
> ################################################################################
>
> and eventually errors out with this exception.
>
> ################################################################################
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/88b40cbbc4c841f99eae906af3b93cda
>  first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04 
> last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/de309e5c7b3841a6b4fd299ac8fa8728
>  first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01 
> last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/e7ed8bc150c9494b8c064a022b3609e0
>  first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04 
> last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/c35e01b66d85450c97da9bb21bfc650f
>  first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04 
> last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/b5904451d27d42f0bcb4c98a5b14f3e9
>  first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01 
> last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01
> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/9d26e9a00e5149cabcb415c6bb429a34
>  first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04 
> last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05
> 16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load more 
> than 32 hfiles to family 0 of region with start key
> 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: 
> Closing master protocol: MasterService
> 16/08/17 20:37:07 INFO client.ConnectionManager$HConnectionImplementation: 
> Closing zookeeper sessionid=0x15696476bf90485
> 16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485 closed
> 16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down
> Exception in thread "main" java.io.IOException: Trying to load more than 32 
> hfiles to one family of one region
>         at 
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:420)
>         at 
> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:314)
>         at 
> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.completebulkload(AbstractBulkLoadTool.java:355)
>         at 
> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(AbstractBulkLoadTool.java:332)
>         at 
> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:270)
>         at 
> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:183)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at 
> org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:101)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> ################################################################################
>
> a count of the table showa 0 rows:
> 0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM;
> +-----------+
> | COUNT(1)  |
> +-----------+
> | 0         |
> +-----------+
>
> Some quick googling gives an hbase param that could be tweaked 
> (http://stackoverflow.com/questions/24950393/trying-to-load-more-than-32-hfiles-to-one-family-of-one-region).
>
> Main Questions:
> - Will the CsvBulkLoadTool pick up these params, or will I need to put them 
> in hbase-site.xml?
> - Is there anything else I can tune to make this run quicker? It took 5 hours 
> for it to fail with the error above.
>
> This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix 
> 4.8.0-HBase-1.1
> Ambari default settings except for:
> - HBase RS heap size is set to 24GB
> - hbase.rpc.timeout set to 20 min
> - phoenix.query.timeoutMs set to 60 min
>
> all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM
>
>
>
>

Re: CsvBulkLoadTool with ~75GB file

Reply via email to