AvroStorage and duplicate records

Alex Holmes Wed, 21 Sep 2011 17:01:30 -0700

Hi all,

I have a simple schema


{"name": "Record", "type": "record",
 "fields": [
   {"name": "name", "type": "string"},
   {"name": "id", "type": "int"}
 ]
}

which I use to write 2 records to an Avro file with the following code:

  public static Record createRecord(String name, int id) {
    Record record = new Record();
    record.name = name;
    record.id = id;
    return record;
  }

  public static void writeToAvro(OutputStream outputStream)
      throws IOException {
    DataFileWriter<Record> writer =
        new DataFileWriter<Record>(new SpecificDatumWriter<Record>());
    writer.create(Record.SCHEMA$, outputStream);

    writer.append(createRecord("r1", 1));
    writer.append(createRecord("r2", 2));

    writer.close();
    outputStream.close();
  }

I also have some reader code which reads in the file and just dumps
the contents of each Record:

    DataFileStream<Record> reader = new DataFileStream<Record>(
            is, new SpecificDatumReader<Record>(Record.SCHEMA$));
    for (Record a : reader) {
      System.out.println(ToStringBuilder.reflectionToString(a));
    }

Its output is:

Record@1e9e5c73[name=r1,id=1]
Record@ed42d08[name=r2,id=2]

When using this file with pig and AvroStorage, pig seems to think
there are 4 records:

grunt> REGISTER /app/hadoop/lib/avro-1.5.4.jar;
grunt> REGISTER /app/pig-0.9.0/contrib/piggybank/java/piggybank.jar;
grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/json-simple-1.1.jar;
grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-core-asl-1.6.0.jar;
grunt> REGISTER /app/pig-0.9.0/build/ivy/lib/Pig/jackson-mapper-asl-1.6.0.jar;
grunt> raw = LOAD 'test.v1.avro' USING
org.apache.pig.piggybank.storage.avro.AvroStorage;
grunt> dump raw;
..
Input(s):
Successfully read 4 records (825 bytes) from:
"hdfs://localhost:9000/user/aholmes/test.v1.avro"

Output(s):
Successfully stored 4 records (46 bytes) in:
"hdfs://localhost:9000/tmp/temp2039109003/tmp1924774585"

Counters:
Total records written : 4
Total bytes written : 46
..
(r1,1)
(r2,2)
(r1,1)
(r2,2)

I'm sure I'm doing something wrong, but would appreciate any help.

Many thanks,
Alex

AvroStorage and duplicate records

Reply via email to