Folks,
I am noticing weird behavior where loading and iterating avro records via
AvroStorage takes long time as compared to iterating via MapReduce job. Any
known issues or any clue as to why AvroStorage would take such long time ?
Example:Schema which I am using:
{ "type": "record", "name": "Timber", "namespace": "com.timber.avro",
"fields": [ { "name": "identifier", "type": "string", "doc":
"Identifier. NonNull." }, { "name": "reservation", "type": [
"null", { "type": "array", "items": {
"name": "Reservation", "type": "record", "fields": [
{ "name": "bookingDate", "type": "long",
"doc": "Timestamp in UTC. NonNull" }, {
"name": "code", "type": [
"null", "string" ], "doc":
"Code.", "default": null },
] } } ], "default": null, "doc": "array of
segment id which this urn belongs." } ]}
---> Pig
using Pig AvroStorage, it takes more than 30 minutes to simple iterate. I have
been adding more optional fields (like code) in above Reservation record. Does
that affect how I am using AvroStorage ?
register /json-simple-1.1.jarregister /piggybank.jar
records = LOAD '/data/*/one.avro' USING
org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check')
reservation = FOREACH records { selectHotelAtt = FOREACH reservation
GENERATE bookingDate; GENERATE FLATTEN(selectHotelAtt.bookingDate)
as bookingDate; };DUMP reservation;
--> MapReduceWhen I use MapReduce job to iterate through all the records it
completes in less than 2 minutes for about million records
Mapper interface @Override public void map(final AvroKey<Timber>
key, final NullWritable value, final Context context) throws IOException,
InterruptedException {
Thanks,Jaikit