The hadoopFile method reuses the Writable object between records that it reads 
by default, so you get back the same object. You should clone them if you need 
to cache them. This is kind of an unintuitive behavior that we’ll probably need 
to turn off by default; it’s helpful when you don’t need to cache the objects 
because it reduces allocation.

Matei

On Dec 11, 2013, at 8:51 PM, Robert Fink <[email protected]> wrote:

> Hi,
> 
> I have a file containing avro GenericRecords; for debug purposes, let' read 
> one particular field, "date_time" and print it to the screen:
> 
>      def sc = new SparkContext("local", "My Spark Context")
>      val job = new org.apache.hadoop.mapreduce.Job
> 
>      // input data:
>      def avrofile = "debug-data/records.avro"
> 
>      // Load
>      val rdd = sc.newAPIHadoopFile(
>        avrofile,
>        classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
>        classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
>        classOf[org.apache.hadoop.io.NullWritable],
>        job.getConfiguration).map( x => x._1.datum )
> 
>      rdd.foreach( x => println( x.get("date_time")))
> 
> This is all jolly good, the output is:
> 2013-10-21T00:19:25-04:00
> 2013-10-21T00:12:39-04:00
> 2013-10-21T00:08:09-04:00
> 2013-10-21T00:12:54-04:00
> [...]
> 
> When I change the loading statement to use caching like this
> 
>        val rdd = sc.newAPIHadoopFile(
>        avrofile,
>        classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
>        classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
>        classOf[org.apache.hadoop.io.NullWritable],
>        job.getConfiguration).map( x => x._1.datum ).cache
> 
>        rdd.foreach( x => println( x.get("date_time")))
> 
> , then all records have the same date_time, in fact, they are all identical 
> records:
> 2013-10-21T00:01:29-04:00
> 2013-10-21T00:01:29-04:00
> 2013-10-21T00:01:29-04:00
> 2013-10-21T00:01:29-04:00
> [...]
> 
> Any idea what's going on here?
> 
> Best,
>   Robert

Reply via email to