Thanks again. Run results here: https://gist.github.com/rjurney/dc0efae486ba7d55b7d5
This time I get a port already in use exception on 4040, but it isn't fatal. Then when I run rdd.first, I get this over and over: 14/06/01 18:35:40 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory On Sun, Jun 1, 2014 at 3:09 PM, Aaron Davidson <ilike...@gmail.com> wrote: > You can avoid that by using the constructor that takes a SparkConf, a la > > val conf = new SparkConf() > conf.setJars("avro.jar", ...) > val sc = new SparkContext(conf) > > > On Sun, Jun 1, 2014 at 2:32 PM, Russell Jurney <russell.jur...@gmail.com> > wrote: > >> Followup question: the docs to make a new SparkContext require that I >> know where $SPARK_HOME is. However, I have no idea. Any idea where that >> might be? >> >> >> On Sun, Jun 1, 2014 at 10:28 AM, Aaron Davidson <ilike...@gmail.com> >> wrote: >> >>> Gotcha. The easiest way to get your dependencies to your Executors would >>> probably be to construct your SparkContext with all necessary jars passed >>> in (as the "jars" parameter), or inside a SparkConf with setJars(). Avro is >>> a "necessary jar", but it's possible your application also needs to >>> distribute other ones to the cluster. >>> >>> An easy way to make sure all your dependencies get shipped to the >>> cluster is to create an assembly jar of your application, and then you just >>> need to tell Spark about that jar, which includes all your application's >>> transitive dependencies. Maven and sbt both have pretty straightforward >>> ways of producing assembly jars. >>> >>> >>> On Sat, May 31, 2014 at 11:23 PM, Russell Jurney < >>> russell.jur...@gmail.com> wrote: >>> >>>> Thanks for the fast reply. >>>> >>>> I am running CDH 4.4 with the Cloudera Parcel of Spark 0.9.0, in >>>> standalone mode. >>>> >>>> >>>> On Saturday, May 31, 2014, Aaron Davidson <ilike...@gmail.com> wrote: >>>> >>>>> First issue was because your cluster was configured incorrectly. You >>>>> could probably read 1 file because that was done on the driver node, but >>>>> when it tried to run a job on the cluster, it failed. >>>>> >>>>> Second issue, it seems that the jar containing avro is not getting >>>>> propagated to the Executors. What version of Spark are you running on? >>>>> What >>>>> deployment mode (YARN, standalone, Mesos)? >>>>> >>>>> >>>>> On Sat, May 31, 2014 at 9:37 PM, Russell Jurney < >>>>> russell.jur...@gmail.com> wrote: >>>>> >>>>> Now I get this: >>>>> >>>>> scala> rdd.first >>>>> >>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at >>>>> <console>:41 >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 4 (first at >>>>> <console>:41) with 1 output partitions (allowLocal=true) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 4 >>>>> (first at <console>:41) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: >>>>> List() >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Computing the requested >>>>> partition locally >>>>> >>>>> 14/05/31 21:36:28 INFO rdd.HadoopRDD: Input split: >>>>> hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00000.avro:0+3864 >>>>> >>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Job finished: first at >>>>> <console>:41, took 0.037371256 s >>>>> >>>>> 14/05/31 21:36:28 INFO spark.SparkContext: Starting job: first at >>>>> <console>:41 >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Got job 5 (first at >>>>> <console>:41) with 16 output partitions (allowLocal=true) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Final stage: Stage 5 >>>>> (first at <console>:41) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Parents of final stage: >>>>> List() >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Missing parents: List() >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting Stage 5 >>>>> (HadoopRDD[0] at hadoopRDD at <console>:37), which has no missing parents >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.DAGScheduler: Submitting 16 missing >>>>> tasks from Stage 5 (HadoopRDD[0] at hadoopRDD at <console>:37) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSchedulerImpl: Adding task set >>>>> 5.0 with 16 tasks >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:0 >>>>> as TID 92 on executor 2: hivecluster3 (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:0 >>>>> as 1294 bytes in 1 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:3 >>>>> as TID 93 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:3 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:1 >>>>> as TID 94 on executor 4: hivecluster4 (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:1 >>>>> as 1294 bytes in 1 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:2 >>>>> as TID 95 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:2 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:4 >>>>> as TID 96 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:4 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:6 >>>>> as TID 97 on executor 2: hivecluster3 (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:6 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:5 >>>>> as TID 98 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:5 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:8 >>>>> as TID 99 on executor 4: hivecluster4 (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:8 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:7 >>>>> as TID 100 on executor 0: hivecluster6.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:7 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:10 >>>>> as TID 101 on executor 3: hivecluster1.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task >>>>> 5.0:10 as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:14 >>>>> as TID 102 on executor 2: hivecluster3 (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task >>>>> 5.0:14 as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:9 >>>>> as TID 103 on executor 1: hivecluster5.labs.lan (NODE_LOCAL) >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Serialized task 5.0:9 >>>>> as 1294 bytes in 0 ms >>>>> >>>>> 14/05/31 21:36:28 INFO scheduler.TaskSetManager: Starting task 5.0:11 >>>>> as TID 104 on executor 4: hivecluster4 (N >>>>> >>>>> >>>> >>>> -- >>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com >>>> datasyndrome.com >>>> >>> >>> >> >> >> -- >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome. >> com >> > > -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com