Re: Execution stalls in LogisticRegressionWithSGD

Xiangrui Meng Wed, 09 Jul 2014 10:52:48 -0700

We have maven-enforcer-plugin defined in the pom. I don't know why it
didn't work for you. Could you try rebuild with maven2 and confirm
that there is no error message? If that is the case, please create a
JIRA for it. Thanks! -Xiangrui


On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar <reachb...@gmail.com> wrote:
> Xiangrui,
>
> Thanks for all the help in resolving this issue. The  cause turned out to
> bethe build environment rather than runtime configuration. The build process
> had picked up maven2 while building spark. Using binaries that were rebuilt
> using m3, the entire processing went through fine. While I'm aware that the
> build instruction page specifies m3 as the min requirement, declaratively
> preventing accidental m2 usage (e.g. through something like the maven
> enforcer plugin?) might help other developers avoid such issues.
>
> -Bharath
>
>
>
> On Mon, Jul 7, 2014 at 9:43 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> It seems to me a setup issue. I just tested news20.binary (1355191
>> features) on a 2-node EC2 cluster and it worked well. I added one line
>> to conf/spark-env.sh:
>>
>> export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 "
>>
>> and launched spark-shell with "--driver-memory 20g". Could you re-try
>> with an EC2 setup? If it still doesn't work, please attach all your
>> code and logs.
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Jul 6, 2014 at 1:35 AM, Bharath Ravi Kumar <reachb...@gmail.com>
>> wrote:
>> > Hi Xiangrui,
>> >
>> > 1) Yes, I used the same build (compiled locally from source) to the host
>> > that has (master, slave1) and the second host with slave2.
>> >
>> > 2) The execution was successful when run in local mode with reduced
>> > number
>> > of partitions. Does this imply issues communicating/coordinating across
>> > processes (i.e. driver, master and workers)?
>> >
>> > Thanks,
>> > Bharath
>> >
>> >
>> >
>> > On Sun, Jul 6, 2014 at 11:37 AM, Xiangrui Meng <men...@gmail.com> wrote:
>> >>
>> >> Hi Bharath,
>> >>
>> >> 1) Did you sync the spark jar and conf to the worker nodes after build?
>> >> 2) Since the dataset is not large, could you try local mode first
>> >> using `spark-summit --driver-memory 12g --master local[*]`?
>> >> 3) Try to use less number of partitions, say 5.
>> >>
>> >> If the problem is still there, please attach the full master/worker log
>> >> files.
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Fri, Jul 4, 2014 at 12:16 AM, Bharath Ravi Kumar
>> >> <reachb...@gmail.com>
>> >> wrote:
>> >> > Xiangrui,
>> >> >
>> >> > Leaving the frameSize unspecified led to an error message (and
>> >> > failure)
>> >> > stating that the task size (~11M) was larger. I hence set it to an
>> >> > arbitrarily large value ( I realize 500 was unrealistic & unnecessary
>> >> > in
>> >> > this case). I've now set the size to 20M and repeated the runs. The
>> >> > earlier
>> >> > runs were on an uncached RDD. Caching the RDD (and setting
>> >> > spark.storage.memoryFraction=0.5) resulted in marginal speed up of
>> >> > execution, but the end result remained the same. The cached RDD size
>> >> > is
>> >> > as
>> >> > follows:
>> >> >
>> >> > RDD Name    Storage Level                        Cached Partitions
>> >> > Fraction Cached    Size in Memory    Size in Tachyon        Size on
>> >> > Disk
>> >> > 1084         Memory Deserialized 1x Replicated     80
>> >> > 100%         165.9 MB             0.0 B                 0.0 B
>> >> >
>> >> >
>> >> >
>> >> > The corresponding master logs were:
>> >> >
>> >> > 14/07/04 06:29:34 INFO Master: Removing executor
>> >> > app-20140704062238-0033/1
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:34 INFO Master: Launching executor
>> >> > app-20140704062238-0033/2
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/04 06:29:34 INFO Master: Removing executor
>> >> > app-20140704062238-0033/0
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:34 INFO Master: Launching executor
>> >> > app-20140704062238-0033/3
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/04 06:29:37 INFO Master: Removing executor
>> >> > app-20140704062238-0033/2
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:37 INFO Master: Launching executor
>> >> > app-20140704062238-0033/4
>> >> > on worker worker-20140630124441-slave1-40182
>> >> > 14/07/04 06:29:37 INFO Master: Removing executor
>> >> > app-20140704062238-0033/3
>> >> > because it is EXITED
>> >> > 14/07/04 06:29:37 INFO Master: Launching executor
>> >> > app-20140704062238-0033/5
>> >> > on worker worker-20140630102913-slave2-44735
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 INFO Master: Removing app app-20140704062238-0033
>> >> > 14/07/04 06:29:39 INFO LocalActorRef: Message
>> >> > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
>> >> > from
>> >> > Actor[akka://sparkMaster/deadLetters] to
>> >> >
>> >> >
>> >> > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.3.1.135%3A33061-123#1986674260]
>> >> > was not delivered. [39] dead letters encountered. This logging can be
>> >> > turned
>> >> > off or adjusted with configuration settings 'akka.log-dead-letters'
>> >> > and
>> >> > 'akka.log-dead-letters-during-shutdown'.
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> >> > [akka.tcp://sparkMaster@master:7077] ->
>> >> > [akka.tcp://spark@slave2:45172]:
>> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> >> > akka.remote.EndpointAssociationException: Association failed with
>> >> > [akka.tcp://spark@slave2:45172]
>> >> > Caused by:
>> >> >
>> >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> >> > Connection refused: slave2/10.3.1.135:45172
>> >> > ]
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> >> > [akka.tcp://sparkMaster@master:7077] ->
>> >> > [akka.tcp://spark@slave2:45172]:
>> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> >> > akka.remote.EndpointAssociationException: Association failed with
>> >> > [akka.tcp://spark@slave2:45172]
>> >> > Caused by:
>> >> >
>> >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> >> > Connection refused: slave2/10.3.1.135:45172
>> >> > ]
>> >> > 14/07/04 06:29:39 ERROR EndpointWriter: AssociationError
>> >> > [akka.tcp://sparkMaster@master:7077] ->
>> >> > [akka.tcp://spark@slave2:45172]:
>> >> > Error [Association failed with [akka.tcp://spark@slave2:45172]] [
>> >> > akka.remote.EndpointAssociationException: Association failed with
>> >> > [akka.tcp://spark@slave2:45172]
>> >> > Caused by:
>> >> >
>> >> > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> >> > Connection refused: slave2/10.3.1.135:45172
>> >> > ]
>> >> > 14/07/04 06:29:39 INFO Master: akka.tcp://spark@slave2:45172 got
>> >> > disassociated, removing it.
>> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
>> >> > app-20140704062238-0033/5
>> >> > 14/07/04 06:29:40 WARN Master: Got status update for unknown executor
>> >> > app-20140704062238-0033/4
>> >> >
>> >> >
>> >> > Coincidentally, after the initial executor failed, each following
>> >> > executor
>> >> > that was re-spawned failed with the following logs:
>> >> > (e.g the following was from
>> >> > slave1:~/spark-1.0.1-rc1/work/app-20140704062238-0033/2/stderr)
>> >> >
>> >> > log4j:WARN No appenders could be found for logger
>> >> > (org.apache.hadoop.conf.Configuration).
>> >> > log4j:WARN Please initialize the log4j system properly.
>> >> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>> >> > for
>> >> > more info.
>> >> > 14/07/04 06:29:35 INFO SparkHadoopUtil: Using Spark's default log4j
>> >> > profile:
>> >> > org/apache/spark/log4j-defaults.properties
>> >> > 14/07/04 06:29:35 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:29:35 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:29:35 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:29:35 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://sparkExecutor@slave1:54782]
>> >> > 14/07/04 06:29:36 INFO Remoting: Remoting now listens on addresses:
>> >> > [akka.tcp://sparkExecutor@slave1:54782]
>> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Connecting to
>> >> > driver:
>> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
>> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Connecting to worker
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:29:36 INFO WorkerWatcher: Successfully connected to
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:29:36 INFO CoarseGrainedExecutorBackend: Successfully
>> >> > registered
>> >> > with driver
>> >> > 14/07/04 06:29:36 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:29:36 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:29:36 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:29:36 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:29:36 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://spark@slave1:39753]
>> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to MapOutputTracker:
>> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
>> >> > 14/07/04 06:29:36 INFO SparkEnv: Connecting to BlockManagerMaster:
>> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
>> >> > 14/07/04 06:29:36 INFO DiskBlockManager: Created local directory at
>> >> > /tmp/spark-local-20140704062936-6123
>> >> > 14/07/04 06:29:36 INFO MemoryStore: MemoryStore started with capacity
>> >> > 6.7
>> >> > GB.
>> >> > 14/07/04 06:29:36 INFO ConnectionManager: Bound socket to port 50960
>> >> > with id
>> >> > = ConnectionManagerId(slave1,50960)
>> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Trying to register
>> >> > BlockManager
>> >> > 14/07/04 06:29:36 INFO BlockManagerMaster: Registered BlockManager
>> >> > 14/07/04 06:29:36 INFO HttpFileServer: HTTP File server directory is
>> >> > /tmp/spark-42c2782f-60f8-45a7-9e11-c789fc87fe2e
>> >> > 14/07/04 06:29:36 INFO HttpServer: Starting HTTP Server
>> >> > 14/07/04 06:29:36 ERROR CoarseGrainedExecutorBackend: Driver
>> >> > Disassociated
>> >> > [akka.tcp://sparkExecutor@slave1:54782] ->
>> >> > [akka.tcp://spark@master:45172]
>> >> > disassociated! Shutting down.
>> >> >
>> >> > In case of the initial executor that successfully started, the
>> >> > corresponding
>> >> > log messages (from
>> >> > spark-1.0.1-rc1/work/app-20140704062238-0033/1/stderr) on
>> >> > the executor were:
>> >> > log4j:WARN No appenders could be found for logger
>> >> > (org.apache.hadoop.conf.Configuration).
>> >> > log4j:WARN Please initialize the log4j system properly.
>> >> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>> >> > for
>> >> > more info.
>> >> > 14/07/04 06:22:39 INFO SparkHadoopUtil: Using Spark's default log4j
>> >> > profile:
>> >> > org/apache/spark/log4j-defaults.properties
>> >> > 14/07/04 06:22:39 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:22:39 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:22:39 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:22:39 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:22:39 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://sparkExecutor@slave1:50806]
>> >> > 14/07/04 06:22:39 INFO Remoting: Remoting now listens on addresses:
>> >> > [akka.tcp://sparkExecutor@slave1:50806]
>> >> > 14/07/04 06:22:39 INFO CoarseGrainedExecutorBackend: Connecting to
>> >> > driver:
>> >> > akka.tcp://spark@master:45172/user/CoarseGrainedScheduler
>> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Connecting to worker
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:22:39 INFO WorkerWatcher: Successfully connected to
>> >> > akka.tcp://sparkWorker@slave1:40182/user/Worker
>> >> > 14/07/04 06:22:40 INFO CoarseGrainedExecutorBackend: Successfully
>> >> > registered
>> >> > with driver
>> >> > 14/07/04 06:22:40 INFO SecurityManager: Changing view acls to: user1
>> >> > 14/07/04 06:22:40 INFO SecurityManager: SecurityManager:
>> >> > authentication
>> >> > disabled; ui acls disabled; users with view permissions: Set(user1)
>> >> > 14/07/04 06:22:40 INFO Slf4jLogger: Slf4jLogger started
>> >> > 14/07/04 06:22:40 INFO Remoting: Starting remoting
>> >> > 14/07/04 06:22:40 INFO Remoting: Remoting started; listening on
>> >> > addresses
>> >> > :[akka.tcp://spark@slave1:38558]
>> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to MapOutputTracker:
>> >> > akka.tcp://spark@master:45172/user/MapOutputTracker
>> >> > 14/07/04 06:22:40 INFO SparkEnv: Connecting to BlockManagerMaster:
>> >> > akka.tcp://spark@master:45172/user/BlockManagerMaster
>> >> > 14/07/04 06:22:40 INFO DiskBlockManager: Created local directory at
>> >> > /tmp/spark-local-20140704062240-6a65
>> >> > 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity
>> >> > 6.7
>> >> > GB.
>> >> > 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901
>> >> > with id
>> >> > = ConnectionManagerId(slave1,46901)
>> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register
>> >> > BlockManager
>> >> > 14/07/04 06:22:40 INFO BlockManagerMaster: Registered BlockManager
>> >> > 14/07/04 06:22:40 INFO HttpFileServer: HTTP File server directory is
>> >> > /tmp/spark-9eba78f9-8ae9-477c-9338-7222ae6fe306
>> >> > 14/07/04 06:22:40 INFO HttpServer: Starting HTTP Server
>> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> > task 0
>> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 0
>> >> > 14/07/04 06:22:42 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> > task 2
>> >> > 14/07/04 06:22:42 INFO Executor: Running task ID 2
>> >> > ...
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Jul 4, 2014 at 5:52 AM, Xiangrui Meng <men...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> The feature dimension is small. You don't need a big akka.frameSize.
>> >> >> The default one (10M) should be sufficient. Did you cache the data
>> >> >> before calling LRWithSGD? -Xiangrui
>> >> >>
>> >> >> On Thu, Jul 3, 2014 at 10:02 AM, Bharath Ravi Kumar
>> >> >> <reachb...@gmail.com>
>> >> >> wrote:
>> >> >> > I tried another run after setting the driver memory to 8G (and
>> >> >> > spark.akka.frameSize = 500 on the executors and the driver). In
>> >> >> > addition, I
>> >> >> > also tried to reduce the amount of data that a single task
>> >> >> > processes,
>> >> >> > by
>> >> >> > increasing the number of partitions (of the labeled points) to 120
>> >> >> > (instead
>> >> >> > of 2 used earlier), and then setting max cores to 2. That made no
>> >> >> > difference
>> >> >> > since, at the end of 120 tasks, the familiar error message
>> >> >> > appeared
>> >> >> > on a
>> >> >> > slave:
>> >> >> >
>> >> >> > <snipped earlier logs>
>> >> >> > 14/07/03 16:18:48 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> >> > task
>> >> >> > 1436
>> >> >> > 14/07/03 16:18:48 INFO Executor: Running task ID 1436
>> >> >> > 14/07/03 16:18:53 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00014:0+2215337
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00014:2215337+2215338
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00003:0+2196429
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00003:2196429+2196430
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00010:0+2186751
>> >> >> > 14/07/03 16:18:54 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00010:2186751+2186751
>> >> >> > 14/07/03 16:18:54 INFO Executor: Serialized size of result for
>> >> >> > 1436
>> >> >> > is
>> >> >> > 5958822
>> >> >> > 14/07/03 16:18:54 INFO Executor: Sending result for 1436 directly
>> >> >> > to
>> >> >> > driver
>> >> >> > 14/07/03 16:18:54 INFO Executor: Finished task ID 1436
>> >> >> > 14/07/03 16:18:54 INFO CoarseGrainedExecutorBackend: Got assigned
>> >> >> > task
>> >> >> > 1438
>> >> >> > 14/07/03 16:18:54 INFO Executor: Running task ID 1438
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00004:0+2209615
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00004:2209615+2209616
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00011:0+2202240
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00011:2202240+2202240
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00009:0+2194423
>> >> >> > 14/07/03 16:19:00 INFO HadoopRDD: Input split:
>> >> >> > file:~//2014-05-24-02/part-r-00009:2194423+2194424
>> >> >> > 14/07/03 16:19:00 INFO Executor: Serialized size of result for
>> >> >> > 1438
>> >> >> > is
>> >> >> > 5958822
>> >> >> > 14/07/03 16:19:00 INFO Executor: Sending result for 1438 directly
>> >> >> > to
>> >> >> > driver
>> >> >> > 14/07/03 16:19:00 INFO Executor: Finished task ID 1438
>> >> >> > 14/07/03 16:19:14 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> > Disassociated
>> >> >> > [akka.tcp://sparkExecutor@slave1:51099] ->
>> >> >> > [akka.tcp://spark@master:58272]
>> >> >> > disassociated! Shutting down.
>> >> >> >
>> >> >> >
>> >> >> > The corresponding master logs were:
>> >> >> >
>> >> >> > 4/07/03 16:02:14 INFO Master: Registering app LogRegExp
>> >> >> > 14/07/03 16:02:14 INFO Master: Registered app LogRegExp with ID
>> >> >> > app-20140703160214-0028
>> >> >> > 14/07/03 16:02:14 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/1
>> >> >> > on worker worker-20140630124441-slave1-40182
>> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/1
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/2
>> >> >> > on worker worker-20140630124441-slave1-40182
>> >> >> > 14/07/03 16:19:15 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/0
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:15 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/3
>> >> >> > on worker worker-20140630102913-slave2-44735
>> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/2
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/4
>> >> >> > on worker worker-20140630124441-slave1-40182
>> >> >> > 14/07/03 16:19:18 INFO Master: Removing executor
>> >> >> > app-20140703160214-0028/3
>> >> >> > because it is EXITED
>> >> >> > 14/07/03 16:19:18 INFO Master: Launching executor
>> >> >> > app-20140703160214-0028/5
>> >> >> > on worker worker-20140630102913-slave2-44735
>> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> >> >> > disassociated, removing it.
>> >> >> > 14/07/03 16:19:20 INFO Master: Removing app
>> >> >> > app-20140703160214-0028
>> >> >> > 14/07/03 16:19:20 INFO Master: akka.tcp://spark@master:58272 got
>> >> >> > disassociated, removing it.
>> >> >> >
>> >> >> >
>> >> >> > Throughout the execution, I confirmed in the UI that driver memory
>> >> >> > used
>> >> >> > was
>> >> >> > 0.0 B / 6.9 GB and each executor's memory showed 0.0 B / 12.1 GB
>> >> >> > even
>> >> >> > when
>> >> >> > aggregate was being executed. On a related note, I noticed in the
>> >> >> > executors
>> >> >> > tab that just before the entire job terminated, executors on
>> >> >> > slave1,
>> >> >> > slave2
>> >> >> > and the driver "disappeared" momentarily from the active executors
>> >> >> > list.
>> >> >> > The
>> >> >> > replacement  executors on slave1 and slave2 were re-spawned a
>> >> >> > couple
>> >> >> > of
>> >> >> > times and appeared on the executors list again before they too
>> >> >> > died
>> >> >> > and
>> >> >> > the
>> >> >> > job failed.
>> >> >> > So it appears that no matter what the task input-result size, the
>> >> >> > execution
>> >> >> > fails at the end of the stage corresponding to
>> >> >> > GradientDescent.aggregate
>> >> >> > (and the preceding count() in GradientDescent goes through fine).
>> >> >> > Let
>> >> >> > me
>> >> >> > know if you need any additional information.
>> >> >> >
>> >> >> >
>> >> >> > On Thu, Jul 3, 2014 at 12:27 PM, Xiangrui Meng <men...@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Could you check the driver memory in the executor tab of the
>> >> >> >> Spark
>> >> >> >> UI
>> >> >> >> when the job is running? If it is too small, please set
>> >> >> >> --driver-memory with spark-submit, e.g. 10g. Could you also
>> >> >> >> attach
>> >> >> >> the
>> >> >> >> master log under spark/logs as well? -Xiangrui
>> >> >> >>
>> >> >> >> On Wed, Jul 2, 2014 at 9:34 AM, Bharath Ravi Kumar
>> >> >> >> <reachb...@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi Xiangrui,
>> >> >> >> >
>> >> >> >> > The issue with aggergating/counting over large feature vectors
>> >> >> >> > (as
>> >> >> >> > part
>> >> >> >> > of
>> >> >> >> > LogisticRegressionWithSGD) continues to exist, but now in
>> >> >> >> > another
>> >> >> >> > form:
>> >> >> >> > while the execution doesn't freeze (due to SPARK-1112), it now
>> >> >> >> > fails
>> >> >> >> > at
>> >> >> >> > the
>> >> >> >> > second or third gradient descent iteration consistently with an
>> >> >> >> > error
>> >> >> >> > level
>> >> >> >> > log message, but no stacktrace. I'm running against 1.0.1-rc1,
>> >> >> >> > and
>> >> >> >> > have
>> >> >> >> > tried setting spark.akka.frameSize as high as 500. When the
>> >> >> >> > execution
>> >> >> >> > fails,
>> >> >> >> > each of the two executors log the following message
>> >> >> >> > (corresponding
>> >> >> >> > to
>> >> >> >> > aggregate at GradientDescent.scala:178) :
>> >> >> >> >
>> >> >> >> > 14/07/02 14:09:09 INFO
>> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> >> > maxBytesInFlight: 50331648, targetRequestSize: 10066329
>> >> >> >> > 14/07/02 14:09:09 INFO
>> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> >> > Getting 2 non-empty blocks out of 2 blocks
>> >> >> >> > 14/07/02 14:09:09 INFO
>> >> >> >> > BlockFetcherIterator$BasicBlockFetcherIterator:
>> >> >> >> > Started 1 remote fetches in 0 ms
>> >> >> >> > 14/07/02 14:09:11 INFO Executor: Serialized size of result for
>> >> >> >> > 737
>> >> >> >> > is
>> >> >> >> > 5959086
>> >> >> >> > 14/07/02 14:09:11 INFO Executor: Sending result for 737
>> >> >> >> > directly
>> >> >> >> > to
>> >> >> >> > driver
>> >> >> >> > 14/07/02 14:09:11 INFO Executor: Finished task ID 737
>> >> >> >> > 14/07/02 14:09:18 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> >> > Disassociated
>> >> >> >> > [akka.tcp://sparkExecutor@(slave1,slave2):51941] ->
>> >> >> >> > [akka.tcp://spark@master:59487] disassociated! Shutting down.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > There is no separate stacktrace on the driver side.
>> >> >> >> >
>> >> >> >> > Each input record is of the form p1, p2, (p1,p2) where p1, p2 &
>> >> >> >> > (p1,p2)
>> >> >> >> > are
>> >> >> >> > categorical features with large cardinality, and X is the
>> >> >> >> > double
>> >> >> >> > label
>> >> >> >> > with
>> >> >> >> > a continuous value. The categorical variables are converted to
>> >> >> >> > binary
>> >> >> >> > variables which results in a feature vector of size 741092
>> >> >> >> > (composed
>> >> >> >> > of
>> >> >> >> > all
>> >> >> >> > unique categories across p1, p2 and (p1,p2)). Thus, the labeled
>> >> >> >> > point
>> >> >> >> > for
>> >> >> >> > input record is a sparse vector of size 741092 with only 3
>> >> >> >> > variables
>> >> >> >> > set
>> >> >> >> > in
>> >> >> >> > the record. The total number of records is 683233 after
>> >> >> >> > aggregating
>> >> >> >> > the
>> >> >> >> > input data on (p1, p2). When attempting to train on the
>> >> >> >> > unaggregated
>> >> >> >> > records
>> >> >> >> > (1337907 in number spread across 455 files), the execution
>> >> >> >> > fails
>> >> >> >> > at
>> >> >> >> > count,
>> >> >> >> > GradientDescent.scala:161 with the following log
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > (Snipped lines corresponding to other input files)
>> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> >> > file:~/part-r-00012:2834590+2834590
>> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> >> > file:~/part-r-00005:0+2845559
>> >> >> >> > 14/07/02 16:02:03 INFO HadoopRDD: Input split:
>> >> >> >> > file:~/part-r-00005:2845559+2845560
>> >> >> >> > 14/07/02 16:02:03 INFO Executor: Serialized size of result for
>> >> >> >> > 726
>> >> >> >> > is
>> >> >> >> > 615
>> >> >> >> > 14/07/02 16:02:03 INFO Executor: Sending result for 726
>> >> >> >> > directly
>> >> >> >> > to
>> >> >> >> > driver
>> >> >> >> > 14/07/02 16:02:03 INFO Executor: Finished task ID 726
>> >> >> >> > 14/07/02 16:02:12 ERROR CoarseGrainedExecutorBackend: Driver
>> >> >> >> > Disassociated
>> >> >> >> > [akka.tcp://sparkExecutor@slave1:48423] ->
>> >> >> >> > [akka.tcp://spark@master:55792]
>> >> >> >> > disassociated! Shutting down.
>> >> >> >> >
>> >> >> >> > A count() attempted on the input RDD before beginning training
>> >> >> >> > has
>> >> >> >> > the
>> >> >> >> > following metrics:
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Metric            Min        25th    Median    75th     Max
>> >> >> >> >
>> >> >> >> > Result
>> >> >> >> > serialization
>> >> >> >> > time            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >> >> >
>> >> >> >> > Duration        33 s    33 s    35 s    35 s    35 s
>> >> >> >> >
>> >> >> >> > Time spent
>> >> >> >> > fetching task
>> >> >> >> > results            0 ms    0 ms    0 ms    0 ms    0 ms
>> >> >> >> >
>> >> >> >> > Scheduler
>> >> >> >> > delay            0.1 s    0.1 s    0.3 s    0.3 s    0.3 s
>> >> >> >> >
>> >> >> >> > Aggregated Metrics by Executor
>> >> >> >> >
>> >> >> >> > ID     Address Task             Time Total Failed Succeeded
>> >> >> >> > Shuffle
>> >> >> >> > Read
>> >> >> >> > Shuffle Write     Shuf Spill (Mem)     Shuf Spill (Disk)
>> >> >> >> > 0     CANNOT FIND ADDRESS     34 s     1     0         1
>> >> >> >> > 0.0
>> >> >> >> > B
>> >> >> >> > 0.0 B             0.0 B                 0.0 B
>> >> >> >> > 1     CANNOT FIND ADDRESS     36 s     1     0         1
>> >> >> >> > 0.0
>> >> >> >> > B
>> >> >> >> > 0.0 B             0.0 B                 0.0 B
>> >> >> >> >
>> >> >> >> > Tasks
>> >> >> >> >
>> >> >> >> > Task Index    Task ID    Status    Locality Level    Executor
>> >> >> >> > Launch
>> >> >> >> > Time
>> >> >> >> > Duration    GC Time    Result Ser Time    Errors
>> >> >> >> > 0     726     SUCCESS         PROCESS_LOCAL     slave1
>> >> >> >> > 2014/07/02
>> >> >> >> > 16:01:28 35 s         0.1 s
>> >> >> >> > 1     727     SUCCESS         PROCESS_LOCAL     slave2
>> >> >> >> > 2014/07/02
>> >> >> >> > 16:01:28 33 s         99 ms
>> >> >> >> >
>> >> >> >> > Any pointers / diagnosis please?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
>> >> >> >> > <reachb...@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> Thanks. I'll await the fix to re-run my test.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng
>> >> >> >> >> <men...@gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >>>
>> >> >> >> >>> Hi Bharath,
>> >> >> >> >>>
>> >> >> >> >>> This is related to SPARK-1112, which we already found the
>> >> >> >> >>> root
>> >> >> >> >>> cause.
>> >> >> >> >>> I will let you know when this is fixed.
>> >> >> >> >>>
>> >> >> >> >>> Best,
>> >> >> >> >>> Xiangrui
>> >> >> >> >>>
>> >> >> >> >>> On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
>> >> >> >> >>> <reachb...@gmail.com>
>> >> >> >> >>> wrote:
>> >> >> >> >>> > Couple more points:
>> >> >> >> >>> > 1)The inexplicable stalling of execution with large feature
>> >> >> >> >>> > sets
>> >> >> >> >>> > appears
>> >> >> >> >>> > similar to that reported with the news-20 dataset:
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> > http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3c53a03542.1010...@gmail.com%3E
>> >> >> >> >>> >
>> >> >> >> >>> > 2) The NPE trying to call mapToPair convert an RDD<Long,
>> >> >> >> >>> > Long,
>> >> >> >> >>> > Integer,
>> >> >> >> >>> > Integer> into a JavaPairRDD<Tuple2<Long,Long>,
>> >> >> >> >>> > Tuple2<Integer,Integer>>
>> >> >> >> >>> > is
>> >> >> >> >>> > unrelated to mllib.
>> >> >> >> >>> >
>> >> >> >> >>> > Thanks,
>> >> >> >> >>> > Bharath
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> > On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
>> >> >> >> >>> > <reachb...@gmail.com>
>> >> >> >> >>> > wrote:
>> >> >> >> >>> >>
>> >> >> >> >>> >> Hi  Xiangrui ,
>> >> >> >> >>> >>
>> >> >> >> >>> >> I'm using 1.0.0.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Thanks,
>> >> >> >> >>> >> Bharath
>> >> >> >> >>> >>
>> >> >> >> >>> >> On 18-Jun-2014 1:43 am, "Xiangrui Meng" <men...@gmail.com>
>> >> >> >> >>> >> wrote:
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> Hi Bharath,
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> Thanks for posting the details! Which Spark version are
>> >> >> >> >>> >>> you
>> >> >> >> >>> >>> using?
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> Best,
>> >> >> >> >>> >>> Xiangrui
>> >> >> >> >>> >>>
>> >> >> >> >>> >>> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar
>> >> >> >> >>> >>> <reachb...@gmail.com>
>> >> >> >> >>> >>> wrote:
>> >> >> >> >>> >>> > Hi,
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > (Apologies for the long mail, but it's necessary to
>> >> >> >> >>> >>> > provide
>> >> >> >> >>> >>> > sufficient
>> >> >> >> >>> >>> > details considering the number of issues faced.)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I'm running into issues testing
>> >> >> >> >>> >>> > LogisticRegressionWithSGD
>> >> >> >> >>> >>> > a
>> >> >> >> >>> >>> > two
>> >> >> >> >>> >>> > node
>> >> >> >> >>> >>> > cluster
>> >> >> >> >>> >>> > (each node with 24 cores and 16G available to slaves
>> >> >> >> >>> >>> > out
>> >> >> >> >>> >>> > of
>> >> >> >> >>> >>> > 24G
>> >> >> >> >>> >>> > on
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > system). Here's a description of the application:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > The model is being trained based on categorical
>> >> >> >> >>> >>> > features
>> >> >> >> >>> >>> > x,
>> >> >> >> >>> >>> > y,
>> >> >> >> >>> >>> > and
>> >> >> >> >>> >>> > (x,y).
>> >> >> >> >>> >>> > The categorical features are mapped to binary features
>> >> >> >> >>> >>> > by
>> >> >> >> >>> >>> > converting
>> >> >> >> >>> >>> > each
>> >> >> >> >>> >>> > distinct value in the category enum into a binary
>> >> >> >> >>> >>> > feature
>> >> >> >> >>> >>> > by
>> >> >> >> >>> >>> > itself
>> >> >> >> >>> >>> > (i.e
>> >> >> >> >>> >>> > presence of that value in a record implies
>> >> >> >> >>> >>> > corresponding
>> >> >> >> >>> >>> > feature
>> >> >> >> >>> >>> > =
>> >> >> >> >>> >>> > 1,
>> >> >> >> >>> >>> > else
>> >> >> >> >>> >>> > feature = 0. So, there'd be as many distinct features
>> >> >> >> >>> >>> > as
>> >> >> >> >>> >>> > enum
>> >> >> >> >>> >>> > values) .
>> >> >> >> >>> >>> > The
>> >> >> >> >>> >>> > training vector is laid out as
>> >> >> >> >>> >>> > [x1,x2...xn,y1,y2....yn,(x1,y1),(x2,y2)...(xn,yn)].
>> >> >> >> >>> >>> > Each
>> >> >> >> >>> >>> > record
>> >> >> >> >>> >>> > in
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > training data has only one combination (Xk,Yk) and a
>> >> >> >> >>> >>> > label
>> >> >> >> >>> >>> > appearing in
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > record. Thus, the corresponding labeledpoint sparse
>> >> >> >> >>> >>> > vector
>> >> >> >> >>> >>> > would
>> >> >> >> >>> >>> > only
>> >> >> >> >>> >>> > have 3
>> >> >> >> >>> >>> > values Xk, Yk, (Xk,Yk) set for a record. The total
>> >> >> >> >>> >>> > length
>> >> >> >> >>> >>> > of
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > vector
>> >> >> >> >>> >>> > (though parse) would be nearly 614000.  The number of
>> >> >> >> >>> >>> > records
>> >> >> >> >>> >>> > is
>> >> >> >> >>> >>> > about
>> >> >> >> >>> >>> > 1.33
>> >> >> >> >>> >>> > million. The records have been coalesced into 20
>> >> >> >> >>> >>> > partitions
>> >> >> >> >>> >>> > across
>> >> >> >> >>> >>> > two
>> >> >> >> >>> >>> > nodes. The input data has not been cached.
>> >> >> >> >>> >>> > (NOTE: I do realize the records & features may seem
>> >> >> >> >>> >>> > large
>> >> >> >> >>> >>> > for
>> >> >> >> >>> >>> > a
>> >> >> >> >>> >>> > two
>> >> >> >> >>> >>> > node
>> >> >> >> >>> >>> > setup, but given the memory & cpu, and the fact that
>> >> >> >> >>> >>> > I'm
>> >> >> >> >>> >>> > willing
>> >> >> >> >>> >>> > to
>> >> >> >> >>> >>> > give up
>> >> >> >> >>> >>> > some turnaround time, I don't see why tasks should
>> >> >> >> >>> >>> > inexplicably
>> >> >> >> >>> >>> > fail)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Additional parameters include:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > spark.executor.memory = 14G
>> >> >> >> >>> >>> > spark.default.parallelism = 1
>> >> >> >> >>> >>> > spark.cores.max=20
>> >> >> >> >>> >>> > spark.storage.memoryFraction=0.8 //No cache space
>> >> >> >> >>> >>> > required
>> >> >> >> >>> >>> > (Trying to set spark.akka.frameSize to a larger number,
>> >> >> >> >>> >>> > say,
>> >> >> >> >>> >>> > 20
>> >> >> >> >>> >>> > didn't
>> >> >> >> >>> >>> > help
>> >> >> >> >>> >>> > either)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > The model training was initialized as : new
>> >> >> >> >>> >>> > LogisticRegressionWithSGD(1,
>> >> >> >> >>> >>> > maxIterations, 0.0, 0.05)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > However, after 4 iterations of gradient descent, the
>> >> >> >> >>> >>> > entire
>> >> >> >> >>> >>> > execution
>> >> >> >> >>> >>> > appeared to stall inexplicably. The corresponding
>> >> >> >> >>> >>> > executor
>> >> >> >> >>> >>> > details
>> >> >> >> >>> >>> > and
>> >> >> >> >>> >>> > details of the stalled stage (number 14) are as
>> >> >> >> >>> >>> > follows:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Metric                        Min        25th
>> >> >> >> >>> >>> > Median
>> >> >> >> >>> >>> > 75th
>> >> >> >> >>> >>> > Max
>> >> >> >> >>> >>> > Result serialization time    12 ms    13 ms    14 ms
>> >> >> >> >>> >>> > 16
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 18
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > Duration                    4 s        4 s        5 s
>> >> >> >> >>> >>> > 5 s
>> >> >> >> >>> >>> > 5 s
>> >> >> >> >>> >>> > Time spent fetching task     0 ms    0 ms    0 ms    0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > results
>> >> >> >> >>> >>> > Scheduler delay                6 s        6 s        6
>> >> >> >> >>> >>> > s
>> >> >> >> >>> >>> > 6 s
>> >> >> >> >>> >>> > 12 s
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Stage Id
>> >> >> >> >>> >>> > 14 aggregate at GradientDescent.scala:178
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Task Index    Task ID    Status    Locality Level
>> >> >> >> >>> >>> > Executor
>> >> >> >> >>> >>> > Launch Time                Duration    GC     Result
>> >> >> >> >>> >>> > Ser
>> >> >> >> >>> >>> > Time
>> >> >> >> >>> >>> > Errors
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Time
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > 0     600     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >> >>> >>> > 1     601     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >> >> >>> >>> > 2     602     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >> >>> >>> > 3     603     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27         1.1 h
>> >> >> >> >>> >>> > 4     604     RUNNING     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     1.1 h
>> >> >> >> >>> >>> > 5     605     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >> >> >>> >>> > 6     606     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >> >> >>> >>> > 7     607     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     2 s     12 ms
>> >> >> >> >>> >>> > 8     608     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >> >>> >>> > 9     609     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     14 ms
>> >> >> >> >>> >>> > 10     610     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >> >>> >>> > 11     611     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     13 ms
>> >> >> >> >>> >>> > 12     612     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     18 ms
>> >> >> >> >>> >>> > 13     613     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     13 ms
>> >> >> >> >>> >>> > 14     614     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     4 s     1 s     14 ms
>> >> >> >> >>> >>> > 15     615     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     12 ms
>> >> >> >> >>> >>> > 16     616     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     15 ms
>> >> >> >> >>> >>> > 17     617     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 5 s     1 s     18 ms
>> >> >> >> >>> >>> > 18     618     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > serious.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27     5 s     1 s     16 ms
>> >> >> >> >>> >>> > 19     619     SUCCESS     PROCESS_LOCAL
>> >> >> >> >>> >>> > casual.dataone.foo.bar.com
>> >> >> >> >>> >>> > 2014/06/17 10:32:27 4 s     1 s     18 ms
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Executor stats:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > RDD Blocks    Memory Used    Disk Used    Active Tasks
>> >> >> >> >>> >>> > Failed
>> >> >> >> >>> >>> > Tasks
>> >> >> >> >>> >>> > Complete Tasks    Total Tasks    Task Time    Shuffle
>> >> >> >> >>> >>> > Read
>> >> >> >> >>> >>> > Shuffle
>> >> >> >> >>> >>> > Write
>> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         2
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > 307         309         23.2 m         0.0 B
>> >> >> >> >>> >>> > 0.0
>> >> >> >> >>> >>> > B
>> >> >> >> >>> >>> > 0     0.0 B / 6.7 GB         0.0 B         3
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > 308         311         22.4 m         0.0 B
>> >> >> >> >>> >>> > 0.0
>> >> >> >> >>> >>> > B
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Executor jmap output:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Server compiler detected.
>> >> >> >> >>> >>> > JVM version is 24.55-b03
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > using thread-local object allocation.
>> >> >> >> >>> >>> > Parallel GC with 18 thread(s)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Heap Configuration:
>> >> >> >> >>> >>> >    MinHeapFreeRatio = 40
>> >> >> >> >>> >>> >    MaxHeapFreeRatio = 70
>> >> >> >> >>> >>> >    MaxHeapSize      = 10737418240 (10240.0MB)
>> >> >> >> >>> >>> >    NewSize          = 1310720 (1.25MB)
>> >> >> >> >>> >>> >    MaxNewSize       = 17592186044415 MB
>> >> >> >> >>> >>> >    OldSize          = 5439488 (5.1875MB)
>> >> >> >> >>> >>> >    NewRatio         = 2
>> >> >> >> >>> >>> >    SurvivorRatio    = 8
>> >> >> >> >>> >>> >    PermSize         = 21757952 (20.75MB)
>> >> >> >> >>> >>> >    MaxPermSize      = 134217728 (128.0MB)
>> >> >> >> >>> >>> >    G1HeapRegionSize = 0 (0.0MB)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Heap Usage:
>> >> >> >> >>> >>> > PS Young Generation
>> >> >> >> >>> >>> > Eden Space:
>> >> >> >> >>> >>> >    capacity = 2783969280 (2655.0MB)
>> >> >> >> >>> >>> >    used     = 192583816 (183.66223907470703MB)
>> >> >> >> >>> >>> >    free     = 2591385464 (2471.337760925293MB)
>> >> >> >> >>> >>> >    6.917598458557704% used
>> >> >> >> >>> >>> > From Space:
>> >> >> >> >>> >>> >    capacity = 409993216 (391.0MB)
>> >> >> >> >>> >>> >    used     = 1179808 (1.125152587890625MB)
>> >> >> >> >>> >>> >    free     = 408813408 (389.8748474121094MB)
>> >> >> >> >>> >>> >    0.2877628102022059% used
>> >> >> >> >>> >>> > To Space:
>> >> >> >> >>> >>> >    capacity = 385351680 (367.5MB)
>> >> >> >> >>> >>> >    used     = 0 (0.0MB)
>> >> >> >> >>> >>> >    free     = 385351680 (367.5MB)
>> >> >> >> >>> >>> >    0.0% used
>> >> >> >> >>> >>> > PS Old Generation
>> >> >> >> >>> >>> >    capacity = 7158628352 (6827.0MB)
>> >> >> >> >>> >>> >    used     = 4455093024 (4248.707794189453MB)
>> >> >> >> >>> >>> >    free     = 2703535328 (2578.292205810547MB)
>> >> >> >> >>> >>> >    62.2338918146983% used
>> >> >> >> >>> >>> > PS Perm Generation
>> >> >> >> >>> >>> >    capacity = 90701824 (86.5MB)
>> >> >> >> >>> >>> >    used     = 45348832 (43.248016357421875MB)
>> >> >> >> >>> >>> >    free     = 45352992 (43.251983642578125MB)
>> >> >> >> >>> >>> >    49.99770677158598% used
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > 8432 interned Strings occupying 714672 bytes.
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Executor GC log snippet:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > 168.778: [GC [PSYoungGen: 2702831K->578545K(2916864K)]
>> >> >> >> >>> >>> > 9302453K->7460857K(9907712K), 0.3193550 secs] [Times:
>> >> >> >> >>> >>> > user=5.13
>> >> >> >> >>> >>> > sys=0.39,
>> >> >> >> >>> >>> > real=0.32 secs]
>> >> >> >> >>> >>> > 169.097: [Full GC [PSYoungGen: 578545K->0K(2916864K)]
>> >> >> >> >>> >>> > [ParOldGen:
>> >> >> >> >>> >>> > 6882312K->1073297K(6990848K)]
>> >> >> >> >>> >>> > 7460857K->1073297K(9907712K)
>> >> >> >> >>> >>> > [PSPermGen:
>> >> >> >> >>> >>> > 44248K->44201K(88576K)], 4.5521090 secs] [Times:
>> >> >> >> >>> >>> > user=24.22
>> >> >> >> >>> >>> > sys=0.18,
>> >> >> >> >>> >>> > real=4.55 secs]
>> >> >> >> >>> >>> > 174.207: [GC [PSYoungGen: 2338304K->81315K(2544128K)]
>> >> >> >> >>> >>> > 3411653K->1154665K(9534976K), 0.0966280 secs] [Times:
>> >> >> >> >>> >>> > user=1.66
>> >> >> >> >>> >>> > sys=0.00,
>> >> >> >> >>> >>> > real=0.09 secs]
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I tried to map partitions to cores on the nodes.
>> >> >> >> >>> >>> > Increasing
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > number
>> >> >> >> >>> >>> > of
>> >> >> >> >>> >>> > partitions (say to 80 or 100) would result in progress
>> >> >> >> >>> >>> > till
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > 6th
>> >> >> >> >>> >>> > iteration or so, but the next stage would stall as
>> >> >> >> >>> >>> > before
>> >> >> >> >>> >>> > with
>> >> >> >> >>> >>> > apparent
>> >> >> >> >>> >>> > root
>> >> >> >> >>> >>> > cause / logs. With increased partitions, the last stage
>> >> >> >> >>> >>> > that
>> >> >> >> >>> >>> > completed
>> >> >> >> >>> >>> > had
>> >> >> >> >>> >>> > the following task times:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Metric                        Min        25th
>> >> >> >> >>> >>> > Median
>> >> >> >> >>> >>> > 75th
>> >> >> >> >>> >>> > Max
>> >> >> >> >>> >>> > Result serialization time    11 ms    12 ms    13 ms
>> >> >> >> >>> >>> > 15
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 0.4 s
>> >> >> >> >>> >>> > Duration                    0.5 s    0.9 s    1 s
>> >> >> >> >>> >>> > 3
>> >> >> >> >>> >>> > s
>> >> >> >> >>> >>> > 7 s
>> >> >> >> >>> >>> > Time spent fetching            0 ms    0 ms    0 ms
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > 0
>> >> >> >> >>> >>> > ms
>> >> >> >> >>> >>> > task results
>> >> >> >> >>> >>> > Scheduler delay                5 s        6 s        6
>> >> >> >> >>> >>> > s
>> >> >> >> >>> >>> > 7 s
>> >> >> >> >>> >>> > 12 s
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > My hypothesis is that as the coefficient array becomes
>> >> >> >> >>> >>> > less
>> >> >> >> >>> >>> > sparse
>> >> >> >> >>> >>> > (with
>> >> >> >> >>> >>> > successive iterations), the cost of the aggregate goes
>> >> >> >> >>> >>> > up
>> >> >> >> >>> >>> > to
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > point
>> >> >> >> >>> >>> > that
>> >> >> >> >>> >>> > it stalls (which I failed to explain). Reducing the
>> >> >> >> >>> >>> > batch
>> >> >> >> >>> >>> > fraction
>> >> >> >> >>> >>> > to a
>> >> >> >> >>> >>> > very
>> >> >> >> >>> >>> > low number like 0.01 saw the iterations progress
>> >> >> >> >>> >>> > further,
>> >> >> >> >>> >>> > but
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > model
>> >> >> >> >>> >>> > failed to converge in that case after a small number of
>> >> >> >> >>> >>> > iterations.
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I also tried reducing the number of records by
>> >> >> >> >>> >>> > aggregating
>> >> >> >> >>> >>> > on
>> >> >> >> >>> >>> > (x,y)
>> >> >> >> >>> >>> > as
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > key (i.e. using aggregations instead of training on
>> >> >> >> >>> >>> > every
>> >> >> >> >>> >>> > raw
>> >> >> >> >>> >>> > record),
>> >> >> >> >>> >>> > but
>> >> >> >> >>> >>> > encountered by the following exception:
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Loss was due to java.lang.NullPointerException
>> >> >> >> >>> >>> > java.lang.NullPointerException
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> > org.apache.spark.scheduler.Task.run(Task.scala:51)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> >> >> >>> >>> >         at
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> >> >> >>> >>> >         at java.lang.Thread.run(Thread.java:745)
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > I'd appreciate any insights/comments about what may be
>> >> >> >> >>> >>> > causing
>> >> >> >> >>> >>> > the
>> >> >> >> >>> >>> > execution
>> >> >> >> >>> >>> > to stall.
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > If logs/tables appear poorly indented in the email,
>> >> >> >> >>> >>> > here's
>> >> >> >> >>> >>> > a
>> >> >> >> >>> >>> > gist
>> >> >> >> >>> >>> > with
>> >> >> >> >>> >>> > relevant details:
>> >> >> >> >>> >>> > https://gist.github.com/reachbach/a418ab2f01b639b624c1
>> >> >> >> >>> >>> >
>> >> >> >> >>> >>> > Thanks,
>> >> >> >> >>> >>> > Bharath
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

Re: Execution stalls in LogisticRegressionWithSGD

Reply via email to