skipping rows, seeking within an avro file via python

Troy X Thu, 05 Jul 2018 15:11:50 -0700

Hi E
veryone,


I'm a bit new to Avro format, trying to process slightly large avro file w. 300 
columns 200K rows with python 3.

However, it's a bit slow and I would like to try processing individual parts of 
file with 5 process.

I wonder if there is any easy way to seek within an avro file without causing 
data corruption, rather than looping through each record sequentially ?

I believe it is a splittable format since it can be processed via 
mapreduce/spark in parallel, but I'm not sure if python avro module supports 
jumping within a file to find a safe position to start reading from.

Currently all I can do is to process it row by row, which doesn't help for 
parallelisation;

reader = DataFileReader(open("users.avro", "rb"), DatumReader())
    i = 0
    for user in reader:
        i += 1
        if (i>10000):
          break

or should I switch to C or Java to process bigger files if not spark/mapreduce ?

Thanks,

skipping rows, seeking within an avro file via python

Reply via email to