I was getting this error after upgrading my nodes to Python2.7. I suspected
the problem was due to conflicting Python versions, but my 2.7 install
seemed correct on my nodes.
I set the PYSPARK_PYTHON variable to my 2.7 install (as I still had 2.6
installed and linked to the 'python' executable, w
> scala.Option.foreach(Option.scala:236) at
>>> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>>> at
&
h.Mailbox.processMailbox(Mailbox.scala:237) at
>> akka.dispatch.Mailbox.run(Mailbox.scala:219) at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
t.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>>
>
>
> The lambda passed to flatMap() returns a list of tuples; take() works fine
> just on the flatMap().
>
> Where would I start to troubleshoot this error?
>
> The error output includes m
I've done a whole bunch of things to this RDD, and now when I try to
sortByKey(), this is what I get:
>>> flattened_po.flatMap(lambda x: map_to_database_types(x)).sortByKey()14/02/28
23:18:41 INFO spark.SparkContext: Starting job: sortByKey at :114/02/28
23:18:41 INFO scheduler.DAGScheduler: Got j