Hello, I'm looking at the Dataset API in 1.6.1 and also in upcoming 2.0. However it does not seems to work with Avro data types:
object Datasets extends App { val conf = new SparkConf() conf.setAppName("Dataset") conf.setMaster("local[2]") conf.setIfMissing("spark.serializer", classOf[KryoSerializer].getName) conf.setIfMissing("spark.kryo.registrator", classOf[DatasetKryoRegistrator].getName) val sc = new SparkContext(conf) val sql = new SQLContext(sc) import sql.implicits._ implicit val encoder = Encoders.kryo[MyAvroType] val data = sql.read.parquet("path/to/data").as[MyAvroType] var c = 0 // BUG here val sizes = data.mapPartitions { iter => List(iter.size).iterator }.collect().toList println(c) } class DatasetKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register( classOf[MyAvroType], AvroSerializer.SpecificRecordBinarySerializer[MyAvroType]) } } I'm using chill-avro's kryo servirilizer for avro types and I've tried `Encoders.kyro` as well as `bean` or `javaSerialization`, but none of them works. The errors seems to be that the generated code does not compile with janino. Tested in 1.6.1 and the 2.0.0-preview. Any idea? -- *JU Han* Software Engineer @ Teads.tv +33 0619608888