I have done this kind of thing successfully using Hadoop serialization, e.g. SessionContainer extends Writable and override write/readFields. I didn't try Kyro.
It's fairly straightforward, I'll see if I can dig up the code if you really need it. I remember that I had to add a map transformation or something to that effect since Hadoop sometimes gives you a mutated reference to a previous object rather than a new one :-( Also, I don't think you need to parallelize sampledSessions in your code snippet. I think this will work: val sampledSessions = sc.sequenceFile[Text, SessionContainer](inputPath).takeSample(false, 1000, 0) sampledSessions.saveAsSequenceFile("sampledSessions") How many small files are you getting? I tend to think you will get as many files as partitions, which is usually not that high. ----- Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-Writable-and-Spark-serialization-tp5721p5729.html Sent from the Apache Spark User List mailing list archive at Nabble.com.