I have a bit of a strange situation:
*****************
import org.apache.avro.generic.{GenericData, GenericRecord}
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper, AvroKey}
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.hadoop.io.{NullWritable, WritableUtils}
val path = "/path/to/data.avro"
val rdd = sc.newAPIHadoopFile(path,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable])
rdd.take(10).foreach( x => println( x._1.datum() ))
*****************
In this situation, I get the right number of records returned, and if I
look at the contents of rdd I see the individual records as tuple2's...
however, if I println on each one as shown above, I get the same result
every time.
Apparently this has to do with something in Spark or Avro keeping a
reference to the item its iterating over, so I need to clone the object
before I use it. However, if I try to clone it (from the spark-shell
console), I get:
*****************
rdd.take(10).foreach( x => {
val clonedDatum = x._1.datum().clone()
println(clonedDatum.datum())
})
<console>:37: error: method clone in class Object cannot be accessed in
org.apache.avro.generic.GenericRecord
Access to protected method clone not permitted because
prefix type org.apache.avro.generic.GenericRecord does not conform to
class $iwC where the access take place
val clonedDatum = x._1.datum().clone()
*****************
So, how can I clone the datum?
Seems I'm not the only one who ran into this problem:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't
figure out how to fix it in my case without hacking away like the person in
the linked PR did.
Suggestions?
--
Chris Miller