Re: hadoopRDD stalls reading entire directory

Russell Jurney Sun, 01 Jun 2014 18:48:07 -0700

Thanks again. Run results here:
https://gist.github.com/rjurney/dc0efae486ba7d55b7d5


This time I get a port already in use exception on 4040, but it isn't
fatal. Then when I run rdd.first, I get this over and over:

14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has
not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient memory



On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> You can avoid that by using the constructor that takes a SparkConf, a la
>
> val conf = new SparkConf()
> conf.setJars("avro.jar", ...)
> val sc = new SparkContext(conf)
>
>
> On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> Followup question: the docs to make a new SparkContext require that I
>> know where $SPARK_HOME is. However, I have no idea. Any idea where that
>> might be?
>>
>>
>> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com>
>> wrote:
>>
>>> Gotcha. The easiest way to get your dependencies to your Executors would
>>> probably be to construct your SparkContext with all necessary jars passed
>>> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
>>> a "necessary jar", but it's possible your application also needs to
>>> distribute other ones to the cluster.
>>>
>>> An easy way to make sure all your dependencies get shipped to the
>>> cluster is to create an assembly jar of your application, and then you just
>>> need to tell Spark about that jar, which includes all your application's
>>> transitive dependencies. Maven and sbt both have pretty straightforward
>>> ways of producing assembly jars.
>>>
>>>
>>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
>>> russell.jur...@gmail.com> wrote:
>>>
>>>> Thanks for the fast reply.
>>>>
>>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>>>> standalone mode.
>>>>
>>>>
>>>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote:
>>>>
>>>>> First issue was because your cluster was configured incorrectly. You
>>>>> could probably read 1 file because that was done on the driver node, but
>>>>> when it tried to run a job on the cluster, it failed.
>>>>>
>>>>> Second issue, it seems that the jar containing avro is not getting
>>>>> propagated to the Executors. What version of Spark are you running on? 
>>>>> What
>>>>> deployment mode (YARN, standalone, Mesos)?
>>>>>
>>>>>
>>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
>>>>> russell.jur...@gmail.com> wrote:
>>>>>
>>>>> Now I get this:
>>>>>
>>>>> scala> rdd.first
>>>>>
>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>> <console>:41
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>>>>> <console>:41) with 1 output partitions (allowLocal=true)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>>>>> (first at <console>:41)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>>>>> List()
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
>>>>> partition locally
>>>>>
>>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>>>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>>>>
>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>>>>> <console>:41, took 0.037371256 s
>>>>>
>>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>>> <console>:41
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>>>>> <console>:41) with 16 output partitions (allowLocal=true)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>>>>> (first at <console>:41)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>>>>> List()
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>>>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set
>>>>> 5.0 with 16 tasks
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0
>>>>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
>>>>> as 1294 bytes in 1 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3
>>>>> as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1
>>>>> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1
>>>>> as 1294 bytes in 1 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2
>>>>> as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4
>>>>> as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6
>>>>> as TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5
>>>>> as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8
>>>>> as TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7
>>>>> as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10
>>>>> as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>> 5.0:10 as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14
>>>>> as TID 102 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task
>>>>> 5.0:14 as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9
>>>>> as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9
>>>>> as 1294 bytes in 0 ms
>>>>>
>>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11
>>>>> as TID 104 on executor 4: hivecluster4 (N
>>>>>
>>>>>
>>>>
>>>> --
>>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
>>>> datasyndrome.com
>>>>
>>>
>>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
>> com
>>
>
>


-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: hadoopRDD stalls reading entire directory

Reply via email to