Hi,
I have a file containing avro GenericRecords; for debug purposes, let' read
one particular field, "date_time" and print it to the screen:
def sc = new SparkContext("local", "My Spark Context")
val job = new org.apache.hadoop.mapreduce.Job
// input data:
def avrofile = "debug-data/records.avro"
// Load
val rdd = sc.newAPIHadoopFile(
avrofile,
classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
classOf[org.apache.hadoop.io.NullWritable],
job.getConfiguration).map( x => x._1.datum )
rdd.foreach( x => println( x.get("date_time")))
This is all jolly good, the output is:
2013-10-21T00:19:25-04:00
2013-10-21T00:12:39-04:00
2013-10-21T00:08:09-04:00
2013-10-21T00:12:54-04:00
[...]
When I change the loading statement to use caching like this
val rdd = sc.newAPIHadoopFile(
avrofile,
classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
classOf[org.apache.hadoop.io.NullWritable],
job.getConfiguration).map( x => x._1.datum ).cache
rdd.foreach( x => println( x.get("date_time")))
, then all records have the same date_time, in fact, they are all identical
records:
2013-10-21T00:01:29-04:00
2013-10-21T00:01:29-04:00
2013-10-21T00:01:29-04:00
2013-10-21T00:01:29-04:00
[...]
Any idea what's going on here?
Best,
Robert