The byte array turns out to be a serialized ObjectOutputStream that contains a Tuple2[ParallelCollectionRDD,Function2].
What then should be done differently in the broadcast code (which follows the structure of an example taken from mllib)? assert(crows.isInstanceOf[Array[MVector]]) val bcRows = sc.broadcast(crows) .. val arrayVect = bcRows.value 2014-10-30 7:42 GMT-07:00 Stephen Boesch <java...@gmail.com>: > > As a template for creating a broadcast variable, the following code > snippet within mllib was used: > > val bcIdf = dataset.context.broadcast(idf) > dataset.mapPartitions { iter => > val thisIdf = bcIdf.value > > > The new code follows that model: > > import org.apache.spark.mllib.linalg.{Vector => MVector} > .. > assert(crows.isInstanceOf[Array[MVector]]) > val bcRows = sc.broadcast(crows) > val GU = mat.rows.zipWithIndex.mapPartitions { case dataIter => > val arrayVect = bcRows.value // bcRows.value is seen in > debugger to be of type Array[Byte] .. ?? > > That last line is unhappy: > > java.lang.ClassCastException: [B cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > > So the compiler is aware that the return type of the broadcast "value" > method should be an array of vector (which it should). However the actual > type is Array[Byte]. Any insights on this? > >