spark & avro: caching leads to identical records?

Robert Fink Wed, 11 Dec 2013 20:54:31 -0800

Hi,

I have a file containing avro GenericRecords; for debug purposes, let' read
one particular field, "date_time" and print it to the screen:


     def sc = new SparkContext("local", "My Spark Context")
     val job = new org.apache.hadoop.mapreduce.Job

     // input data:
     def avrofile = "debug-data/records.avro"

     // Load
     val rdd = sc.newAPIHadoopFile(
       avrofile,
       classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
       classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
       classOf[org.apache.hadoop.io.NullWritable],
       job.getConfiguration).map( x => x._1.datum )

     rdd.foreach( x => println( x.get("date_time")))

This is all jolly good, the output is:
2013-10-21T00:19:25-04:00
2013-10-21T00:12:39-04:00
2013-10-21T00:08:09-04:00
2013-10-21T00:12:54-04:00
[...]

When I change the loading statement to use caching like this

       val rdd = sc.newAPIHadoopFile(
       avrofile,
       classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
       classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
       classOf[org.apache.hadoop.io.NullWritable],
       job.getConfiguration).map( x => x._1.datum ).cache

       rdd.foreach( x => println( x.get("date_time")))

, then all records have the same date_time, in fact, they are all identical
records:
2013-10-21T00:01:29-04:00
2013-10-21T00:01:29-04:00
2013-10-21T00:01:29-04:00
2013-10-21T00:01:29-04:00
[...]

Any idea what's going on here?

Best,
  Robert

spark & avro: caching leads to identical records?

Reply via email to