Re: hadoopRDD stalls reading entire directory

Russell Jurney Mon, 02 Jun 2014 11:33:13 -0700

If it matters, I have servers running at
http://hivecluster2:4040/stages/ and http://hivecluster2:4041/stages/


When I run rdd.first, I see an item at
http://hivecluster2:4041/stages/ but no tasks are running. Stage ID 1,
first at <console>:46, Tasks: Succeeded/Total 0/16.

On Mon, Jun 2, 2014 at 10:09 AM, Russell Jurney
<russell.jur...@gmail.com> wrote:
> Looks like just worker and master processes are running:
>
> [hivedata@hivecluster2 ~]$ jps
>
> 10425 Jps
>
> [hivedata@hivecluster2 ~]$ ps aux|grep spark
>
> hivedata 10424  0.0  0.0 103248   820 pts/3    S+   10:05   0:00 grep spark
>
> root     10918  0.5  1.4 4752880 230512 ?      Sl   May27  41:43 java -cp
> :/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/conf:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/core/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/repl/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/examples/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/bagel/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/mllib/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/streaming/lib/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-mapreduce/*:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/scala-library.jar:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/scala-compiler.jar:/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib/jline.jar
> -Dspark.akka.logLifecycleEvents=true
> -Djava.library.path=/opt/cloudera/parcels/SPARK-0.9.0-1.cdh4.6.0.p0.98/lib/spark/lib:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
> -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip hivecluster2
> --port 7077 --webui-port 18080
>
> root     12715  0.0  0.0 148028   656 ?        S    May27   0:00 sudo
> /opt/cloudera/parcels/SPARK/lib/spark/bin/spark-class
> org.apache.spark.deploy.worker.Worker spark://hivecluster2:7077
>
> root     12716  0.3  1.1 4155884 191340 ?      Sl   May27  30:21 java -cp
> :/opt/cloudera/parcels/SPARK/lib/spark/conf:/opt/cloudera/parcels/SPARK/lib/spark/core/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/repl/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/examples/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/bagel/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/mllib/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/streaming/lib/*:/opt/cloudera/parcels/SPARK/lib/spark/lib/*:/etc/hadoop/conf:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop/../hadoop-mapreduce/*:/opt/cloudera/parcels/SPARK/lib/spark/lib/scala-library.jar:/opt/cloudera/parcels/SPARK/lib/spark/lib/scala-compiler.jar:/opt/cloudera/parcels/SPARK/lib/spark/lib/jline.jar
> -Dspark.akka.logLifecycleEvents=true
> -Djava.library.path=/opt/cloudera/parcels/SPARK/lib/spark/lib:/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
> -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker
> spark://hivecluster2:7077
>
>
>
>
> On Sun, Jun 1, 2014 at 7:41 PM, Aaron Davidson <ilike...@gmail.com> wrote:
>>
>> Sounds like you have two shells running, and the first one is talking all
>> your resources. Do a "jps" and kill the other guy, then try again.
>>
>> By the way, you can look at http://localhost:8080 (replace localhost with
>> the server your Spark Master is running on) to see what applications are
>> currently started, and what resource allocations they have.
>>
>>
>> On Sun, Jun 1, 2014 at 6:47 PM, Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>>
>>> Thanks again. Run results here:
>>> https://gist.github.com/rjurney/dc0efae486ba7d55b7d5
>>>
>>> This time I get a port already in use exception on 4040, but it isn't
>>> fatal. Then when I run rdd.first, I get this over and over:
>>>
>>> 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient memory
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson <ilike...@gmail.com>
>>> wrote:
>>>>
>>>> You can avoid that by using the constructor that takes a SparkConf, a la
>>>>
>>>> val conf = new SparkConf()
>>>> conf.setJars("avro.jar", ...)
>>>> val sc = new SparkContext(conf)
>>>>
>>>>
>>>> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney
>>>> <russell.jur...@gmail.com> wrote:
>>>>>
>>>>> Followup question: the docs to make a new SparkContext require that I
>>>>> know where $SPARK_HOME is. However, I have no idea. Any idea where that
>>>>> might be?
>>>>>
>>>>>
>>>>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Gotcha. The easiest way to get your dependencies to your Executors
>>>>>> would probably be to construct your SparkContext with all necessary jars
>>>>>> passed in (as the "jars" parameter), or inside a SparkConf with 
>>>>>> setJars().
>>>>>> Avro is a "necessary jar", but it's possible your application also needs 
>>>>>> to
>>>>>> distribute other ones to the cluster.
>>>>>>
>>>>>> An easy way to make sure all your dependencies get shipped to the
>>>>>> cluster is to create an assembly jar of your application, and then you 
>>>>>> just
>>>>>> need to tell Spark about that jar, which includes all your application's
>>>>>> transitive dependencies. Maven and sbt both have pretty straightforward 
>>>>>> ways
>>>>>> of producing assembly jars.
>>>>>>
>>>>>>
>>>>>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney
>>>>>> <russell.jur...@gmail.com> wrote:
>>>>>>>
>>>>>>> Thanks for the fast reply.
>>>>>>>
>>>>>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>>>>>>> standalone mode.
>>>>>>>
>>>>>>>
>>>>>>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> First issue was because your cluster was configured incorrectly. You
>>>>>>>> could probably read 1 file because that was done on the driver node, 
>>>>>>>> but
>>>>>>>> when it tried to run a job on the cluster, it failed.
>>>>>>>>
>>>>>>>> Second issue, it seems that the jar containing avro is not getting
>>>>>>>> propagated to the Executors. What version of Spark are you running on? 
>>>>>>>> What
>>>>>>>> deployment mode (YARN, standalone, Mesos)?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney
>>>>>>>> <russell.jur...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Now I get this:
>>>>>>>>
>>>>>>>> scala> rdd.first
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>>>>> <console>:41
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>>>>>>>> <console>:41) with 1 output partitions (allowLocal=true)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>>>>>>>> (first at <console>:41)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>>>>>>>> stage: List()
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents:
>>>>>>>> List()
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the
>>>>>>>> requested partition locally
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>>>>>>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>>>>>>>> <console>:41, took 0.037371256 s
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>>>>> <console>:41
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>>>>>>>> <console>:41) with 16 output partitions (allowLocal=true)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>>>>>>>> (first at <console>:41)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final
>>>>>>>> stage: List()
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents:
>>>>>>>> List()
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>>>>>>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing 
>>>>>>>> parents
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>>>>>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
>>>>>>>> 5.0 with 16 tasks
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
>>>>>>>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:0 as 1294 bytes in 1 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
>>>>>>>> as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:3 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
>>>>>>>> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:1 as 1294 bytes in 1 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2
>>>>>>>> as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:2 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4
>>>>>>>> as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:4 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6
>>>>>>>> as TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:6 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5
>>>>>>>> as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:5 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8
>>>>>>>> as TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:8 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7
>>>>>>>> as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:7 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task
>>>>>>>> 5.0:10 as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:10 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task
>>>>>>>> 5.0:14 as TID 102 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:14 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9
>>>>>>>> as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>>>>> 5.0:9 as 1294 bytes in 0 ms
>>>>>>>>
>>>>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task
>>>>>>>> 5.0:11 as TID 104 on executor 4: hivecluster4 (N
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>>>>> datasyndrome.com
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>>> datasyndrome.com
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>> datasyndrome.com
>>
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: hadoopRDD stalls reading entire directory

Reply via email to