Hi E
veryone,
I'm a bit new to Avro format, trying to process slightly large avro file w. 300
columns 200K rows with python 3.
However, it's a bit slow and I would like to try processing individual parts of
file with 5 process.
I wonder if there is any easy way to seek within an avro file without causing
data corruption, rather than looping through each record sequentially ?
I believe it is a splittable format since it can be processed via
mapreduce/spark in parallel, but I'm not sure if python avro module supports
jumping within a file to find a safe position to start reading from.
Currently all I can do is to process it row by row, which doesn't help for
parallelisation;
reader = DataFileReader(open("users.avro", "rb"), DatumReader())
i = 0
for user in reader:
i += 1
if (i>10000):
break
or should I switch to C or Java to process bigger files if not spark/mapreduce ?
Thanks,