I am hoping someone can confirm this is a bug and/or provide a solution. I
am trying to serialize an LDA model to disk for later use, but upon
deserialization the model is not fully functional. In particular,
transformation of data throws a NullPointerException. Here is a minimal
example (just run in 'spark-shell') that exercises the behavior:

https://gist.github.com/bjedwards/14e9bb876381910bc525063bee342b41

The problem is here

https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L456

The issue is that sparkSession is defined as a transient, so is not
serialized and there is never a check to make sure it exists. Weirdly, I
can't find where it is set in the first place. I think that line should
read:

...
val transformer =
oldLocalModel.getTopicDistributionMethod(dataset.sparkSession.sparkContext)
...

ie sparkSession -> dataset.sparkSession, as in all the other places the
dataset's spark session is used.

The model is functional again if I patch up the class via reflection (the
last bit of the gist), as a work around.

Is this a bug? Or are LocalLDA models not really meant to be serializable?

Ben Edwards
Postdoctoral Researcher
IBM Research

Reply via email to