Re: Task not serializable: java.io.NotSerializableException: org.json4s.Serialization$$anon$1

2016-07-19 Thread joshuata
It looks like the problem is that the parse function non-serializeable. This
is most likely because the formats variable is local to the ParseJson
object, and therefore not globally accessible to the cluster. Generally this
problem can be solved by moving the variable inside the closure so that it
is distributed to each worker.

In this specific instance, it makes far more sense to use the  json
datasource
  
provided by newer versions of Spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Task-not-serializable-java-io-NotSerializableException-org-json4s-Serialization-anon-1-tp8233p27359.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to setup the development environment of spark with IntelliJ on ubuntu

2016-07-19 Thread joshuata
I have found the easiest way to set up a development platform is to use the 
databricks sbt-spark-package plugin
  (assuming you are using
scala+sbt). You simply add the plugin to your /project/plugins.sbt
file and add the sparkVersion to your build.sbt file. It automatically loads
the necessary packages to build your applications.

It also provides the sbt console command that sets up a local spark repl to
prototype code against. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-setup-the-development-environment-of-spark-with-IntelliJ-on-ubuntu-tp27333p27357.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Execute function once on each node

2016-07-18 Thread joshuata
I am working on a spark application that requires the ability to run a
function on each node in the cluster. This is used to read data from a
directory that is not globally accessible to the cluster. I have tried
creating an RDD with n elements and n partitions so that it is evenly
distributed among the n nodes, and then mapping a function over the RDD.
However, the runtime makes no guarantees that each partition will be stored
on a separate node. This means that the code will run multiple times on the
same node while never running on another.

I have looked through the documentation and source code for both RDDs and
the scheduler, but I haven't found anything that will do what I need. Does
anybody know of a solution I could use?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Execute-function-once-on-each-node-tp27351.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org