Re: hadoopRDD stalls reading entire directory

Aaron Davidson Sun, 01 Jun 2014 15:10:18 -0700

You can avoid that by using the constructor that takes a SparkConf, a la

val conf = new SparkConf()
conf.setJars("avro.jar", ...)
val sc = new SparkContext(conf)



On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney <russell.jur...@gmail.com>
wrote:

> Followup question: the docs to make a new SparkContext require that I know
> where $SPARK_HOME is. However, I have no idea. Any idea where that might be?
>
>
> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com>
> wrote:
>
>> Gotcha. The easiest way to get your dependencies to your Executors would
>> probably be to construct your SparkContext with all necessary jars passed
>> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is
>> a "necessary jar", but it's possible your application also needs to
>> distribute other ones to the cluster.
>>
>> An easy way to make sure all your dependencies get shipped to the cluster
>> is to create an assembly jar of your application, and then you just need to
>> tell Spark about that jar, which includes all your application's transitive
>> dependencies. Maven and sbt both have pretty straightforward ways of
>> producing assembly jars.
>>
>>
>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney <
>> russell.jur...@gmail.com> wrote:
>>
>>> Thanks for the fast reply.
>>>
>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in
>>> standalone mode.
>>>
>>>
>>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote:
>>>
>>>> First issue was because your cluster was configured incorrectly. You
>>>> could probably read 1 file because that was done on the driver node, but
>>>> when it tried to run a job on the cluster, it failed.
>>>>
>>>> Second issue, it seems that the jar containing avro is not getting
>>>> propagated to the Executors. What version of Spark are you running on? What
>>>> deployment mode (YARN, standalone, Mesos)?
>>>>
>>>>
>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney <
>>>> russell.jur...@gmail.com> wrote:
>>>>
>>>> Now I get this:
>>>>
>>>> scala> rdd.first
>>>>
>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>> <console>:41
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at
>>>> <console>:41) with 1 output partitions (allowLocal=true)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4
>>>> (first at <console>:41)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>>>> List()
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested
>>>> partition locally
>>>>
>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split:
>>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864
>>>>
>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at
>>>> <console>:41, took 0.037371256 s
>>>>
>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at
>>>> <console>:41
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at
>>>> <console>:41) with 16 output partitions (allowLocal=true)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5
>>>> (first at <console>:41)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage:
>>>> List()
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5
>>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing
>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0
>>>> with 16 tasks
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as
>>>> TID 92 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0
>>>> as 1294 bytes in 1 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as
>>>> TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as
>>>> TID 94 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1
>>>> as 1294 bytes in 1 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as
>>>> TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as
>>>> TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as
>>>> TID 97 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as
>>>> TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as
>>>> TID 99 on executor 4: hivecluster4 (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as
>>>> TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10
>>>> as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:10
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14
>>>> as TID 102 on executor 2: hivecluster3 (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:14
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 as
>>>> TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL)
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9
>>>> as 1294 bytes in 0 ms
>>>>
>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11
>>>> as TID 104 on executor 4: hivecluster4 (N
>>>>
>>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome
>>> .com
>>>
>>
>>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.
> com
>

Re: hadoopRDD stalls reading entire directory

Reply via email to