Re: CsvBulkLoadTool with ~75GB file

Gabriel Reid Fri, 19 Aug 2016 07:53:36 -0700

Hi John,

You can actually pre-split a table when creating it, either by
specifying split points in the CREATE TABLE statement[1] or by using
salt buckets[2]. In my current use cases I always use salting, but
take a look at the salting documentation[2] for the pros and cons of
this.


Your approaches as outlined is basically in line with how I would
upload a large amount of data into Phoenix, taking into account that
the tables would be split into multiple regions at creation time using
salt buckets.

About doing ETL to create an aggregate table: personally, if I were
trying to create an aggregate table, I would probably do the
aggregation outside of Phoenix (using MapReduce or Pig or something
similar). This is of course assuming that the aggregate table is the
only thing that is needed (i.e. that the un-aggregated table only
exists to feed the aggregate table). This isn't to say that Phoenix
can't be used to create the aggregate table, but I'm pretty sure that
using HBase as a backing store for large scale ETL/aggregation like
this just won't be as fast as MapReduce.

However, if your use-case does require using Phoenix for the
ETL/aggregation step, then your approach sounds correct.

- Gabriel

1. http://phoenix.apache.org/language/index.html#create_table
2. http://phoenix.apache.org/salted.html

On Fri, Aug 19, 2016 at 3:32 PM, John Leach <jlea...@gmail.com> wrote:
> Gabriel,
>
> Do you guys provide pre-split mechanisms (sampling of import/query data, 
> splitting policies, etc.) or does the admin have to determine the split 
> points?
>
> I guess that begs the question of how you would do a basic ETL operation in 
> Phoenix?
>
> How would you do the following on a 100 gigs of data in the import table to 
> 50 gigs in the aggregate table?
>
> (1) Create import table.
> (2) Import data into that table.
> (3) create an aggregate table.
> (4) Insert data into aggregate table based on an aggregate of the imported 
> table.
>
> Here is what I am gathering from the conversation...
>
> (1) Create Import Table.
> (2) Perform a pre-split based on some information about the data or via some 
> split mechanism (If not, import via MapReduce does not scale).
> (3) Run MapReduce job for importing data (Other mechanism does not scale?)
> (4) Compaction/Statistics Operation on Import Table (If not, query will not 
> scale?)
> (5) Create Aggregate Table
> (6) Perform a pre-split based on some information about the Aggregate data or 
> via some split mechanism. (If not, insert will not scale?).
> (7) Run insert query?  If you had to run import on map reduce to scale, I 
> suspect the insert does MapReduce as well or is there some other mechanism 
> that would scale?
> (8) Compaction/Statistics Operation on Aggregate Table
>
> I really appreciate all the support.  We are trying to run a Phoenix TPCH 
> benchmark and are struggling a bit to understand the process.
>
> Regards,
> John Leach
>
>> On Aug 19, 2016, at 2:09 AM, Gabriel Reid <gabriel.r...@gmail.com> wrote:
>>
>> Hi Aaron,
>>
>> How many regions are there in the LINEITEM table? The fact that you
>> needed to bump the
>> hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily setting up to
>> 48 suggests that the amount of data going into a single region of that
>> table is probably pretty large.
>>
>> Along the same line, I believe regions are the initial unit of
>> parallelism (in the absence of statistics[1]) the when running
>> aggregate queries in Phoenix. This means that if you have oversized
>> regions, you will have poor parallelism when running aggregate
>> queries, which could lead to RPC timeouts.
>>
>> From what I see in the log info that you provided, your count query
>> started at 14:14:06, and errored out at 14:34:15, which appears to be
>> in line with the 20 minute HBase RPC timeout. This appears to indicate
>> that a scan over a single region is taking more than 20 minutes, which
>> again looks to me to be an indicator of an oversized region. If
>> possible, I would look into splitting up your LINEITEM table into more
>> (smaller) regions, which should improve both import and query
>> performance.
>>
>> - Gabriel
>>
>> 1. http://phoenix.apache.org/update_statistics.html
>>
>> On Thu, Aug 18, 2016 at 5:22 PM, Aaron Molitor
>> <amoli...@splicemachine.com> wrote:
>>> Gabriel,
>>>
>>> Thanks for the help, it's good to know that those params can be passed from 
>>> the command line and that the order is important.
>>>
>>> I am trying to load the 100GB TPC-H data set and ultimately run the TPC-H 
>>> queries.  All of the tables loaded relatively easily except LINEITEM (the 
>>> largest) required me to increase the 
>>> hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily to 48.  After that 
>>> the file loaded.
>>>
>>> This brings me to my next question though.  What settings do I need to 
>>> change in order to count the [LINEITEM] table? At this point I have changed:
>>> - hbase.rpc.timeout set to 20 min
>>> - phoenix.query.timeoutMs set to 60 min
>>>
>>> I am still getting an error, it appears to be an RPC timeout, as I have 
>>> mentioned I have already moved to an uncomfortably high setting.  Is there 
>>> some other settings I should be moving and not necessarily the rpc.timeout?
>>>
>>> For reference, here's the full sqlline interaction, including the error:
>>> ################################################################################
>>> Latest phoenix error:
>>> splice@stl-colo-srv073 ~]$ /opt/phoenix/default/bin/sqlline.py 
>>> $(hostname):2181:/hbase-unsecure
>>> Setting property: [incremental, false]
>>> Setting property: [isolation, TRANSACTION_READ_COMMITTED]
>>> issuing: !connect 
>>> jdbc:phoenix:stl-colo-srv073.splicemachine.colo:2181:/hbase-unsecure none 
>>> none org.apache.phoenix.jdbc.PhoenixDriver
>>> Connecting to 
>>> jdbc:phoenix:stl-colo-srv073.splicemachine.colo:2181:/hbase-unsecure
>>> SLF4J: Class path contains multiple SLF4J bindings.
>>> SLF4J: Found binding in 
>>> [jar:file:/opt/phoenix/apache-phoenix-4.8.0-HBase-1.1-bin/phoenix-4.8.0-HBase-1.1-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in 
>>> [jar:file:/usr/hdp/2.4.2.0-258/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>>> explanation.
>>> 16/08/18 14:14:06 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> 16/08/18 14:14:08 WARN shortcircuit.DomainSocketFactory: The short-circuit 
>>> local reads feature cannot be used because libhadoop cannot be loaded.
>>> Connected to: Phoenix (version 4.8)
>>> Driver: PhoenixEmbeddedDriver (version 4.8)
>>> Autocommit status: true
>>> Transaction isolation: TRANSACTION_READ_COMMITTED
>>> Building list of tables and columns for tab-completion (set fastconnect to 
>>> true to skip)...
>>> 147/147 (100%) Done
>>> Done
>>> sqlline version 1.1.9
>>> 0: jdbc:phoenix:stl-colo-srv073.splicemachine> select count(*) from 
>>> TPCH.LINEITEM;
>>> Error: org.apache.phoenix.exception.PhoenixIOException: Failed after 
>>> attempts=36, exceptions:
>>> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException: 
>>> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at 
>>> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9., 
>>> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17 
>>> (state=08000,code=101)
>>> org.apache.phoenix.exception.PhoenixIOException: 
>>> org.apache.phoenix.exception.PhoenixIOException: Failed after attempts=36, 
>>> exceptions:
>>> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException: 
>>> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at 
>>> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9., 
>>> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>>>
>>>        at 
>>> org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:111)
>>>        at 
>>> org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:774)
>>>        at 
>>> org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:720)
>>>        at 
>>> org.apache.phoenix.iterate.ConcatResultIterator.getIterators(ConcatResultIterator.java:50)
>>>        at 
>>> org.apache.phoenix.iterate.ConcatResultIterator.currentIterator(ConcatResultIterator.java:97)
>>>        at 
>>> org.apache.phoenix.iterate.ConcatResultIterator.next(ConcatResultIterator.java:117)
>>>        at 
>>> org.apache.phoenix.iterate.BaseGroupedAggregatingResultIterator.next(BaseGroupedAggregatingResultIterator.java:64)
>>>        at 
>>> org.apache.phoenix.iterate.UngroupedAggregatingResultIterator.next(UngroupedAggregatingResultIterator.java:39)
>>>        at 
>>> org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:778)
>>>        at sqlline.BufferedRows.<init>(BufferedRows.java:37)
>>>        at sqlline.SqlLine.print(SqlLine.java:1649)
>>>        at sqlline.Commands.execute(Commands.java:833)
>>>        at sqlline.Commands.sql(Commands.java:732)
>>>        at sqlline.SqlLine.dispatch(SqlLine.java:807)
>>>        at sqlline.SqlLine.begin(SqlLine.java:681)
>>>        at sqlline.SqlLine.start(SqlLine.java:398)
>>>        at sqlline.SqlLine.main(SqlLine.java:292)
>>> Caused by: java.util.concurrent.ExecutionException: 
>>> org.apache.phoenix.exception.PhoenixIOException: Failed after attempts=36, 
>>> exceptions:
>>> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException: 
>>> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at 
>>> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9., 
>>> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>>>
>>>        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>>        at java.util.concurrent.FutureTask.get(FutureTask.java:202)
>>>        at 
>>> org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:769)
>>>        ... 15 more
>>> Caused by: org.apache.phoenix.exception.PhoenixIOException: Failed after 
>>> attempts=36, exceptions:
>>> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException: 
>>> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at 
>>> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9., 
>>> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>>>
>>>        at 
>>> org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:111)
>>>        at 
>>> org.apache.phoenix.iterate.TableResultIterator.initScanner(TableResultIterator.java:174)
>>>        at 
>>> org.apache.phoenix.iterate.TableResultIterator.next(TableResultIterator.java:124)
>>>        at 
>>> org.apache.phoenix.iterate.SpoolingResultIterator.<init>(SpoolingResultIterator.java:139)
>>>        at 
>>> org.apache.phoenix.iterate.SpoolingResultIterator.<init>(SpoolingResultIterator.java:97)
>>>        at 
>>> org.apache.phoenix.iterate.SpoolingResultIterator.<init>(SpoolingResultIterator.java:69)
>>>        at 
>>> org.apache.phoenix.iterate.SpoolingResultIterator$SpoolingResultIteratorFactory.newIterator(SpoolingResultIterator.java:92)
>>>        at 
>>> org.apache.phoenix.iterate.ParallelIterators$1.call(ParallelIterators.java:114)
>>>        at 
>>> org.apache.phoenix.iterate.ParallelIterators$1.call(ParallelIterators.java:106)
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>>        at 
>>> org.apache.phoenix.job.JobManager$InstrumentedJobFutureTask.run(JobManager.java:183)
>>>        at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>        at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>        at java.lang.Thread.run(Thread.java:745)
>>> Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed 
>>> after attempts=36, exceptions:
>>> Thu Aug 18 14:34:15 UTC 2016, null, java.net.SocketTimeoutException: 
>>> callTimeout=60000, callDuration=1200310: row '' on table 'TPCH.LINEITEM' at 
>>> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9., 
>>> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>>>
>>>        at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:199)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
>>>        at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
>>>        at 
>>> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
>>>        at 
>>> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
>>>        at 
>>> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
>>>        at 
>>> org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
>>>        at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
>>>        at 
>>> org.apache.phoenix.iterate.TableResultIterator.initScanner(TableResultIterator.java:170)
>>>        ... 12 more
>>> Caused by: java.net.SocketTimeoutException: callTimeout=60000, 
>>> callDuration=1200310: row '' on table 'TPCH.LINEITEM' at 
>>> region=TPCH.LINEITEM,,1471407572920.656deb38db6555b3eaea71944fdfdbc9., 
>>> hostname=stl-colo-srv076.splicemachine.colo,16020,1471495858713, seqNum=17
>>>        at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
>>>        at 
>>> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
>>>        ... 3 more
>>> Caused by: java.io.IOException: Call to 
>>> stl-colo-srv076.splicemachine.colo/10.1.1.176:16020 failed on local 
>>> exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27, 
>>> waitTime=1200001, operationTimeout=1200000 expired.
>>>        at 
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.wrapException(AbstractRpcClient.java:278)
>>>        at 
>>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1239)
>>>        at 
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:217)
>>>        at 
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:318)
>>>        at 
>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32831)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:373)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:200)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
>>>        at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:350)
>>>        at 
>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:324)
>>>        at 
>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
>>>        ... 4 more
>>> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27, 
>>> waitTime=1200001, operationTimeout=1200000 expired.
>>>        at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
>>>        at 
>>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1213)
>>>        ... 14 more
>>> 0: jdbc:phoenix:stl-colo-srv073.splicemachine>
>>> ################################################################################
>>>
>>>> On Aug 18, 2016, at 02:15, Gabriel Reid <gabriel.r...@gmail.com> wrote:
>>>>
>>>> Hi Aaron,
>>>>
>>>> I'll answered your questions directly first, but please see the bottom
>>>> part of this mail for important additional details.
>>>>
>>>> You can specify the
>>>> "hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily" parameter
>>>> (referenced from your StackOverflow link) on the command line of you
>>>> CsvBulkLoadTool command -- my understanding is that this is a purely
>>>> client-side parameter. You would provide it via -D as follows:
>>>>
>>>>   hadoop jar phoenix-<version>-client.jar
>>>> org.apache.phoenix.mapreduce.CsvBulkLoadTool
>>>> -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=64 <other
>>>> command-line parameters>
>>>>
>>>> The important point in the above example is that config-based
>>>> parameters specified with -D are given before the application-level
>>>> parameters, and after the class name to be run.
>>>>
>>>> From my read of the HBase code, in this context you can also specify
>>>> the "hbase.hregion.max.filesize" parameter in the same way (in this
>>>> context it's a client-side parameter).
>>>>
>>>> As far as speeding things up, the main points to consider are:
>>>> - ensure that compression is enabled for map-reduce jobs on your
>>>> cluster -- particularly map-output (intermediate) compression - see
>>>> https://datameer.zendesk.com/hc/en-us/articles/204258750-How-to-Use-Intermediate-and-Final-Output-Compression-MR1-YARN-
>>>> for a good overview
>>>> - check the ratio of map output records vs spilled records in the
>>>> counters on the import job. If the spilled records are higher than map
>>>> output records (e.g. twice as high or three times as high), then you
>>>> will probably benefit from raising the mapreduce.task.io.sort.mb
>>>> setting (see 
>>>> https://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml)
>>>>
>>>> Now those are the answers to your questions, but I'm curious about why
>>>> you're getting more than 32 HFiles in a single column family of a
>>>> single region. I assume that this means that you're loading large
>>>> amounts of data into a small number of regions. This is probably not a
>>>> good thing -- it may impact performance of HBase in general (because
>>>> each region has such a large amount of data), and will also have a
>>>> very negative impact on the running time of your import job (because
>>>> part of the parallelism of the import job is determined by the number
>>>> of regions being written to). I don't think you mentioned how many
>>>> regions you have on your table that you're importing to, but
>>>> increasing the number of regions will likely resolve several problems
>>>> for you. Another reason to do this is the fact that HBase will likely
>>>> start splitting your regions after this import due to their size.
>>>>
>>>> - Gabriel
>>>>
>>>>
>>>> On Thu, Aug 18, 2016 at 3:47 AM, Aaron Molitor
>>>> <amoli...@splicemachine.com> wrote:
>>>>> Hi all I'm running the CsvBulkLoadTool trying to pull in some data.  The 
>>>>> MapReduce Job appears to complete, and gives some promising information:
>>>>>
>>>>>
>>>>> ################################################################################
>>>>>       Phoenix MapReduce Import
>>>>>               Upserts Done=600037902
>>>>>       Shuffle Errors
>>>>>               BAD_ID=0
>>>>>               CONNECTION=0
>>>>>               IO_ERROR=0
>>>>>               WRONG_LENGTH=0
>>>>>               WRONG_MAP=0
>>>>>               WRONG_REDUCE=0
>>>>>       File Input Format Counters
>>>>>               Bytes Read=79657289180
>>>>>       File Output Format Counters
>>>>>               Bytes Written=176007436620
>>>>> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles 
>>>>> from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c
>>>>> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process 
>>>>> identifier=hconnection-0xa24982f connecting to ZooKeeper 
>>>>> ensemble=stl-colo-srv073.splicemachine.colo:2181
>>>>> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, 
>>>>> connectString=stl-colo-srv073.splicemachine.colo:2181 
>>>>> sessionTimeout=1200000 watcher=hconnection-0xa24982f0x0, 
>>>>> quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
>>>>> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to 
>>>>> server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not 
>>>>> attempt to authenticate using SASL (unknown error)
>>>>> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection 
>>>>> established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, 
>>>>> initiating session
>>>>> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment 
>>>>> complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, 
>>>>> sessionid = 0x15696476bf90484, negotiated timeout = 40000
>>>>> 16/08/17 20:37:04 INFO mapreduce.AbstractBulkLoadTool: Loading HFiles for 
>>>>> TPCH.LINEITEM from /tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM
>>>>> 16/08/17 20:37:04 WARN mapreduce.LoadIncrementalHFiles: managed 
>>>>> connection cannot be used for bulkload. Creating unmanaged connection.
>>>>> 16/08/17 20:37:04 INFO zookeeper.RecoverableZooKeeper: Process 
>>>>> identifier=hconnection-0x456a0752 connecting to ZooKeeper 
>>>>> ensemble=stl-colo-srv073.splicemachine.colo:2181
>>>>> 16/08/17 20:37:04 INFO zookeeper.ZooKeeper: Initiating client connection, 
>>>>> connectString=stl-colo-srv073.splicemachine.colo:2181 
>>>>> sessionTimeout=1200000 watcher=hconnection-0x456a07520x0, 
>>>>> quorum=stl-colo-srv073.splicemachine.colo:2181, baseZNode=/hbase-unsecure
>>>>> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Opening socket connection to 
>>>>> server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181. Will not 
>>>>> attempt to authenticate using SASL (unknown error)
>>>>> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Socket connection 
>>>>> established to stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, 
>>>>> initiating session
>>>>> 16/08/17 20:37:04 INFO zookeeper.ClientCnxn: Session establishment 
>>>>> complete on server stl-colo-srv073.splicemachine.colo/10.1.1.173:2181, 
>>>>> sessionid = 0x15696476bf90485, negotiated timeout = 40000
>>>>> 16/08/17 20:37:06 INFO hfile.CacheConfig: CacheConfig:disabled
>>>>> ################################################################################
>>>>>
>>>>> and eventually errors out with this exception.
>>>>>
>>>>> ################################################################################
>>>>> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/88b40cbbc4c841f99eae906af3b93cda
>>>>>  first=\x80\x00\x00\x00\x08\xB3\xE7\x84\x80\x00\x00\x04 
>>>>> last=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x03
>>>>> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/de309e5c7b3841a6b4fd299ac8fa8728
>>>>>  first=\x80\x00\x00\x00\x15\xC1\x8Ee\x80\x00\x00\x01 
>>>>> last=\x80\x00\x00\x00\x16\xA0G\xA4\x80\x00\x00\x02
>>>>> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/e7ed8bc150c9494b8c064a022b3609e0
>>>>>  first=\x80\x00\x00\x00\x09\x92\xAEg\x80\x00\x00\x04 
>>>>> last=\x80\x00\x00\x00\x0Aq\x85D\x80\x00\x00\x02
>>>>> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/c35e01b66d85450c97da9bb21bfc650f
>>>>>  first=\x80\x00\x00\x00\x0F\xA9\xFED\x80\x00\x00\x04 
>>>>> last=\x80\x00\x00\x00\x10\x88\xD0$\x80\x00\x00\x03
>>>>> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/b5904451d27d42f0bcb4c98a5b14f3e9
>>>>>  first=\x80\x00\x00\x00\x13%/\x83\x80\x00\x00\x01 
>>>>> last=\x80\x00\x00\x00\x14\x04\x08$\x80\x00\x00\x01
>>>>> 16/08/17 20:37:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> hfile=hdfs://stl-colo-srv073.splicemachine.colo:8020/tmp/66f905f4-3d62-45bf-85fe-c247f518355c/TPCH.LINEITEM/0/9d26e9a00e5149cabcb415c6bb429a34
>>>>>  first=\x80\x00\x00\x00\x06\xF6_\xE3\x80\x00\x00\x04 
>>>>> last=\x80\x00\x00\x00\x07\xD5 f\x80\x00\x00\x05
>>>>> 16/08/17 20:37:07 ERROR mapreduce.LoadIncrementalHFiles: Trying to load 
>>>>> more than 32 hfiles to family 0 of region with start key
>>>>> 16/08/17 20:37:07 INFO 
>>>>> client.ConnectionManager$HConnectionImplementation: Closing master 
>>>>> protocol: MasterService
>>>>> 16/08/17 20:37:07 INFO 
>>>>> client.ConnectionManager$HConnectionImplementation: Closing zookeeper 
>>>>> sessionid=0x15696476bf90485
>>>>> 16/08/17 20:37:07 INFO zookeeper.ZooKeeper: Session: 0x15696476bf90485 
>>>>> closed
>>>>> 16/08/17 20:37:07 INFO zookeeper.ClientCnxn: EventThread shut down
>>>>> Exception in thread "main" java.io.IOException: Trying to load more than 
>>>>> 32 hfiles to one family of one region
>>>>>       at 
>>>>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:420)
>>>>>       at 
>>>>> org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:314)
>>>>>       at 
>>>>> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.completebulkload(AbstractBulkLoadTool.java:355)
>>>>>       at 
>>>>> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.submitJob(AbstractBulkLoadTool.java:332)
>>>>>       at 
>>>>> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.loadData(AbstractBulkLoadTool.java:270)
>>>>>       at 
>>>>> org.apache.phoenix.mapreduce.AbstractBulkLoadTool.run(AbstractBulkLoadTool.java:183)
>>>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>>>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>>>>>       at 
>>>>> org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:101)
>>>>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>       at 
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>       at 
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>       at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>       at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>>>       at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>>> ################################################################################
>>>>>
>>>>> a count of the table showa 0 rows:
>>>>> 0: jdbc:phoenix:srv073> select count(*) from TPCH.LINEITEM;
>>>>> +-----------+
>>>>> | COUNT(1)  |
>>>>> +-----------+
>>>>> | 0         |
>>>>> +-----------+
>>>>>
>>>>> Some quick googling gives an hbase param that could be tweaked 
>>>>> (http://stackoverflow.com/questions/24950393/trying-to-load-more-than-32-hfiles-to-one-family-of-one-region).
>>>>>
>>>>> Main Questions:
>>>>> - Will the CsvBulkLoadTool pick up these params, or will I need to put 
>>>>> them in hbase-site.xml?
>>>>> - Is there anything else I can tune to make this run quicker? It took 5 
>>>>> hours for it to fail with the error above.
>>>>>
>>>>> This is a 9 node (8 RegionServer) cluster running HDP 2.4.2 and Phoenix 
>>>>> 4.8.0-HBase-1.1
>>>>> Ambari default settings except for:
>>>>> - HBase RS heap size is set to 24GB
>>>>> - hbase.rpc.timeout set to 20 min
>>>>> - phoenix.query.timeoutMs set to 60 min
>>>>>
>>>>> all nodes are Dell R420 with 2xE5-2430 v2 CPUs (24vCPU), 64GB RAM
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>

Re: CsvBulkLoadTool with ~75GB file

Reply via email to