Hey Sandy,

Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to 
print the contents of the objects. In addition, something else that helps is to 
do the following:

{
  val  _arr = arr
  models.map(... _arr ...)
}

Basically, copy the global variable into a local one. Then the field access 
from outside (from the interpreter-generated object that contains the line 
initializing arr) is no longer required, and the closure no longer has a 
reference to that.

I'm really confused as to why Array.ofDim would be so large by the way, but are 
you sure you haven't flipped around the dimensions (e.g. it should be 5 x 
1800)? A 5-double array will consume more than 5*8 bytes (probably something 
like 60 at least), and an array of those will still have a pointer to each one, 
so I'd expect that many of them to be more than 80 MB (which is very close to 
1867*5*8).

Matei

> On Nov 10, 2014, at 1:01 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
> 
> I'm experiencing some strange behavior with closure serialization that is 
> totally mind-boggling to me.  It appears that two arrays of equal size take 
> up vastly different amount of space inside closures if they're generated in 
> different ways.
> 
> The basic flow of my app is to run a bunch of tiny regressions using Commons 
> Math's OLSMultipleLinearRegression and then reference a 2D array of the 
> results from a transformation.  I was running into OOME's and 
> NotSerializableExceptions and tried to get closer to the root issue by 
> calling the closure serializer directly.
>   scala> val arr = models.map(_.estimateRegressionParameters()).toArray
> 
> The result array is 1867 x 5. It serialized is 80k bytes, which seems about 
> right:
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
>   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 
> cap=80027]
> 
> If I reference it from a simple function:
>   scala> def func(x: Long) => arr.length
>   scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
> I get a NotSerializableException.
> 
> If I take pains to create the array using a loop:
>   scala> val arr = Array.ofDim[Double](1867, 5)
>   scala> for (s <- 0 until models.length) {
>   | factorWeights(s) = models(s).estimateRegressionParameters()
>   | }
> Serialization works, but the serialized closure for the function is a 
> whopping 400MB.
> 
> If I pass in an array of the same length that was created in a different way, 
> the size of the serialized closure is only about 90K, which seems about right.
> 
> Naively, it seems like somehow the history of how the array was created is 
> having an effect on what happens to it inside a closure.
> 
> Is this expected behavior?  Can anybody explain what's going on?
> 
> any insight very appreciated,
> Sandy


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to