Thanks, Matei. I expected something along these lines.
  Robert

On Fri, Dec 13, 2013 at 5:28 AM, Matei Zaharia <[email protected]>wrote:

> The hadoopFile method reuses the Writable object between records that it
> reads by default, so you get back the same object. You should clone them if
> you need to cache them. This is kind of an unintuitive behavior that we’ll
> probably need to turn off by default; it’s helpful when you don’t need to
> cache the objects because it reduces allocation.
>
> Matei
>
> On Dec 11, 2013, at 8:51 PM, Robert Fink <[email protected]> wrote:
>
> > Hi,
> >
> > I have a file containing avro GenericRecords; for debug purposes, let'
> read one particular field, "date_time" and print it to the screen:
> >
> >      def sc = new SparkContext("local", "My Spark Context")
> >      val job = new org.apache.hadoop.mapreduce.Job
> >
> >      // input data:
> >      def avrofile = "debug-data/records.avro"
> >
> >      // Load
> >      val rdd = sc.newAPIHadoopFile(
> >        avrofile,
> >
>  classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
> >        classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
> >        classOf[org.apache.hadoop.io.NullWritable],
> >        job.getConfiguration).map( x => x._1.datum )
> >
> >      rdd.foreach( x => println( x.get("date_time")))
> >
> > This is all jolly good, the output is:
> > 2013-10-21T00:19:25-04:00
> > 2013-10-21T00:12:39-04:00
> > 2013-10-21T00:08:09-04:00
> > 2013-10-21T00:12:54-04:00
> > [...]
> >
> > When I change the loading statement to use caching like this
> >
> >        val rdd = sc.newAPIHadoopFile(
> >        avrofile,
> >
>  classOf[org.apache.avro.mapreduce.AvroKeyInputFormat[GenericRecord]],
> >        classOf[org.apache.avro.mapred.AvroKey[GenericRecord]],
> >        classOf[org.apache.hadoop.io.NullWritable],
> >        job.getConfiguration).map( x => x._1.datum ).cache
> >
> >        rdd.foreach( x => println( x.get("date_time")))
> >
> > , then all records have the same date_time, in fact, they are all
> identical records:
> > 2013-10-21T00:01:29-04:00
> > 2013-10-21T00:01:29-04:00
> > 2013-10-21T00:01:29-04:00
> > 2013-10-21T00:01:29-04:00
> > [...]
> >
> > Any idea what's going on here?
> >
> > Best,
> >   Robert
>
>

Reply via email to