Hi, I need to batch the values in my final RDD before writing out to hdfs. The idea is to batch multiple "rows" in a protobuf and write those batches out - mostly to save some space as a lot of metadata is the same. e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records instead of 6
What I"m doing is that I'm using mapPartitions by using the grouped function of the iterator by giving it a groupSize. val protoRDD:RDD[MyProto] = rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{ val profiles = MyProto(...) seq.foreach(x =>{ val row = new Row(x._1.toString) row.setFloatValue(x._2) profiles.addRow(row) }) profiles }) ) I haven't been able to test it out because of a separate issue (protobuf version mismatch - in a different thread) - but i'm hoping it will work. Is there a better/straight-forward way of doing this? Thanks Vipul