You can avoid that by using the constructor that takes a SparkConf, a la val conf = new SparkConf() conf.setJars("avro.jar", ...) val sc = new SparkContext(conf)
On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney <russell.jur...@gmail.com> wrote: > Followup question: the docs to make a new SparkContext require that I know > where $SPARK_HOME is. However, I have no idea. Any idea where that might be? > > > On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> Gotcha. The easiest way to get your dependencies to your Executors would >> probably be to construct your SparkContext with all necessary jars passed >> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is >> a "necessary jar", but it's possible your application also needs to >> distribute other ones to the cluster. >> >> An easy way to make sure all your dependencies get shipped to the cluster >> is to create an assembly jar of your application, and then you just need to >> tell Spark about that jar, which includes all your application's transitive >> dependencies. Maven and sbt both have pretty straightforward ways of >> producing assembly jars. >> >> >> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney < >> russell.jur...@gmail.com> wrote: >> >>> Thanks for the fast reply. >>> >>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in >>> standalone mode. >>> >>> >>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote: >>> >>>> First issue was because your cluster was configured incorrectly. You >>>> could probably read 1 file because that was done on the driver node, but >>>> when it tried to run a job on the cluster, it failed. >>>> >>>> Second issue, it seems that the jar containing avro is not getting >>>> propagated to the Executors. What version of Spark are you running on? What >>>> deployment mode (YARN, standalone, Mesos)? >>>> >>>> >>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney < >>>> russell.jur...@gmail.com> wrote: >>>> >>>> Now I get this: >>>> >>>> scala> rdd.first >>>> >>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at >>>> <console>:41 >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at >>>> <console>:41) with 1 output partitions (allowLocal=true) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 >>>> (first at <console>:41) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: >>>> List() >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested >>>> partition locally >>>> >>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: >>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864 >>>> >>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at >>>> <console>:41, took 0.037371256 s >>>> >>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at >>>> <console>:41 >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at >>>> <console>:41) with 16 output partitions (allowLocal=true) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 >>>> (first at <console>:41) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: >>>> List() >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 >>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents >>>> >>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing >>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 >>>> with 16 tasks >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 as >>>> TID 92 on executor 2: hivecluster3 (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 >>>> as 1294 bytes in 1 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 as >>>> TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 as >>>> TID 94 on executor 4: hivecluster4 (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 >>>> as 1294 bytes in 1 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 as >>>> TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 as >>>> TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 as >>>> TID 97 on executor 2: hivecluster3 (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 as >>>> TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 as >>>> TID 99 on executor 4: hivecluster4 (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 as >>>> TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10 >>>> as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:10 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14 >>>> as TID 102 on executor 2: hivecluster3 (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:14 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 as >>>> TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9 >>>> as 1294 bytes in 0 ms >>>> >>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11 >>>> as TID 104 on executor 4: hivecluster4 (N >>>> >>>> >>> >>> -- >>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome >>> .com >>> >> >> > > > -- > Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. > com >