I found the reason why reading HBase is too slow.  Although each
regionserver serves multiple regions for the table I'm reading, the number
of Spark workers allocated by Yarn is too low. Actually, I could see that
the table has dozens of regions spread over about 20 regionservers, but
only two Spark workers are allocated by Yarn. What is worse, the two
workers run one after one. So, the Spark job lost parallelism.

*So now the question is : Why are only 2 workers allocated? *

The following is the log info in ApplicationMaster Log UI and we can see
that only 2 workers are allocated on two nodes (*a04.jsepc.com
<http://a04.jsepc.com>* and *b06 jsepc.com <http://jsepc.com>*)

Showing 4096 bytes. Click here for full log
erLauncher: ApplicationAttemptId: appattempt_1412731028648_0157_000001
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Registering the
ApplicationMaster
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Waiting for Spark driver to be
reachable.
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Driver now available:
a04.jsepc.com:56888
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Listen to driver: akka.tcp://
sp...@a04.jsepc.com:56888/user/CoarseGrainedScheduler
14/10/08 09:55:16 INFO yarn.WorkerLauncher: *Allocating 2 workers*.
14/10/08 09:55:16 INFO yarn.YarnAllocationHandler: *Will Allocate 2 worker
containers, each with 1408 memory*
14/10/08 09:55:16 INFO yarn.YarnAllocationHandler: Container request (host:
Any, priority: 1, capability: <memory:1408, vCores:1>
14/10/08 09:55:16 INFO yarn.YarnAllocationHandler: Container request (host:
Any, priority: 1, capability: <memory:1408, vCores:1>
14/10/08 09:55:20 INFO util.RackResolver: *Resolved a04.jsepc.com
<http://a04.jsepc.com> to /rack1*
14/10/08 09:55:20 INFO util.RackResolver: *Resolved b06.jsepc.com
<http://b06.jsepc.com> to /rack2*
14/10/08 09:55:20 INFO yarn.YarnAllocationHandler: Launching container
container_1412731028648_0157_01_000002 for on host a04.jsepc.com
14/10/08 09:55:20 INFO yarn.YarnAllocationHandler: Launching
WorkerRunnable. driverUrl: akka.tcp://
sp...@a04.jsepc.com:56888/user/CoarseGrainedScheduler,  workerHostname:
a04.jsepc.com
14/10/08 09:55:21 INFO yarn.YarnAllocationHandler: Launching container
container_1412731028648_0157_01_000003 for on host b06.jsepc.com
14/10/08 09:55:21 INFO yarn.YarnAllocationHandler: Launching
WorkerRunnable. driverUrl: akka.tcp://
sp...@a04.jsepc.com:56888/user/CoarseGrainedScheduler,  workerHostname:
b06.jsepc.com
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Starting Worker Container
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Starting Worker Container
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy:
yarn.client.max-nodemanagers-proxies : 500
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy:
yarn.client.max-nodemanagers-proxies : 500
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up
ContainerLaunchContext
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up
ContainerLaunchContext
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Preparing Local resources
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Preparing Local resources
14/10/08 09:55:21 INFO yarn.WorkerLauncher: All workers have launched.
14/10/08 09:55:21 INFO yarn.WorkerLauncher: Started progress reporter
thread - sleep time : 5000
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Prepared Local resources
Map(spark.jar -> resource { scheme: "hdfs" host: "jsepc-ns" port: -1 file:
"/user/root/.sparkStaging/application_1412731028648_0157/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar"
} size: 75288668 timestamp: 1412733307395 type: FILE visibility: PRIVATE)
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Prepared Local resources
Map(spark.jar -> resource { scheme: "hdfs" host: "jsepc-ns" port: -1 file:
"/user/root/.sparkStaging/application_1412731028648_0157/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar"
} size: 75288668 timestamp: 1412733307395 type: FILE visibility: PRIVATE)
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up worker with
commands: List($JAVA_HOME/bin/java -server  -XX:OnOutOfMemoryError='kill
%p' -Xms1024m -Xmx1024m  -Djava.io.tmpdir=$PWD/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
sp...@a04.jsepc.com:56888/user/CoarseGrainedScheduler 2 b06.jsepc.com 1 1>
<LOG_DIR>/stdout 2> <LOG_DIR>/stderr)
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up worker with
commands: List($JAVA_HOME/bin/java -server  -XX:OnOutOfMemoryError='kill
%p' -Xms1024m -Xmx1024m  -Djava.io.tmpdir=$PWD/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
sp...@a04.jsepc.com:56888/user/CoarseGrainedScheduler 1 a04.jsepc.com 1 1>
<LOG_DIR>/stdout 2> <LOG_DIR>/stderr)
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy: *Opening
proxy : a04.jsepc.com:8041 <http://a04.jsepc.com:8041>*
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy: *Opening
proxy : b06.jsepc.com:8041 <http://b06.jsepc.com:8041>*


 Here <http://pastebin.com/VhfmHPQe>is the log printed on console while the
Spark job is running.



2014-10-02 0:58 GMT+08:00 Vladimir Rodionov <vrodio...@splicemachine.com>:

> Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth
> upgrading to the latest version (which is 0.98 based).
>
> -Vladimir Rodionov
>
> On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> As far as I know, that feature is not in CDH 5.0.0
>>
>> FYI
>>
>> On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov <
>> vrodio...@splicemachine.com> wrote:
>>
>>> Using TableInputFormat is not the fastest way of reading data from
>>> HBase. Do not expect 100s of Mb per sec. You probably should take a look at
>>> M/R over HBase snapshots.
>>>
>>> https://issues.apache.org/jira/browse/HBASE-8369
>>>
>>> -Vladimir Rodionov
>>>
>>> On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao <xiaotao.cs....@gmail.com>
>>> wrote:
>>>
>>>> I can submit a MapReduce job reading that table, although its
>>>> processing rate is also a litter slower than I expected, but not that slow
>>>> as Spark.
>>>>
>>>>
>>>>
>>
>

Reply via email to