Variables outside of mapPartitions scope

2014-05-16 Thread pedro
I am working on some code which uses mapPartitions. Its working great, except
when I attempt to use a variable within the function passed to mapPartitions
which references something outside of the scope (for example, a variable
declared immediately before the mapPartitions call). When this happens, I
get a task not serializable error. I wanted to reference a variable which
had been broadcasted, and ready to use within that closure.

Seeing that, I attempted another solution, to store the broadcasted variable
within an object (singleton class, thing). It serialized fine, but when I
ran it on a cluster, any reference to it got a null pointer exception, my
presumption is that the workers were not getting their objects updated for
some reason, despite setting it as a broadcasted variable. My guess is that
the workers get the serialized function, but spark doesn't know to serialize
the object, including the things it reference. Thus the copied reference
becomes invalid.

What would be a good way to solve my problem? Is there a way to reference a
broadcast variable by name rather through a variable?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Variables outside of mapPartitions scope

2014-05-13 Thread ankurdave
In general, you can find out exactly what's not serializable by adding
-Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS.
Since a this reference to the enclosing class is often what's causing the
problem, a general workaround is to move the mapPartitions call to a static
method where there is no this reference. This transforms this:
class A {  def f() = rdd.mapPartitions(iter = ...)}
into this:
class A {  def f() = A.helper(rdd)}object A {  def helper(rdd: RDD[...]) =
rdd.mapPartitions(iter = ...)}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Variables outside of mapPartitions scope

2014-05-13 Thread DB Tsai
Scala's for-loop is not just looping; it's not native looping in bytecode
level. It will create a couple of objects at runtime and performs a
truckload of method calls on them. As a result, if you are referring the
variables outside the for-loop, the whole for-loop object and any variable
inside the loop have to be serializable. Since the for-loop is serializable
in scala, I guess you have something non-serializable inside the for-loop.

The while-loop in scala is native, so you won't have this issue if you use
while-loop.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, May 9, 2014 at 1:13 PM, pedro ski.rodrig...@gmail.com wrote:

 Right now I am not using any class variables (references to this). All my
 variables are created within the scope of the method I am running.

 I did more debugging and found this strange behavior.
 variables here
 for loop
 mapPartitions call
 use variables here
 end mapPartitions
 endfor

 This will result in a serializable bug, but this won't

 variables here
 for loop
 create new references to variables here
 mapPartitions call
 use new reference variables here
 end mapPartitions
 endfor



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Variables outside of mapPartitions scope

2014-05-12 Thread pedro
Right now I am not using any class variables (references to this). All my
variables are created within the scope of the method I am running.

I did more debugging and found this strange behavior.
variables here
for loop
mapPartitions call
use variables here
end mapPartitions
endfor

This will result in a serializable bug, but this won't

variables here
for loop
create new references to variables here
mapPartitions call
use new reference variables here
end mapPartitions
endfor



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.