Avro files have special byte-string markers that can be used to find the beginning of the next block but it will add some complexity. My understanding is the Python Avro libraries are very slow. You may want to try a prototype in Java and see if that meets your performance needs.
Alternatively, you can try writing several small files and process those in parallel. Joshua On Thu, Jul 5, 2018 at 5:11 PM Troy X <[email protected]> wrote: > > Hi E > veryone, > > > I'm a bit new to Avro format, trying to process slightly large avro file > w. 300 columns 200K rows with python 3. > > However, it's a bit slow and I would like to try processing individual > parts of file with 5 process. > > I wonder if there is any easy way to seek within an avro file without > causing data corruption, rather than looping through each record > sequentially ? > > I believe it is a splittable format since it can be processed via > mapreduce/spark in parallel, but I'm not sure if python avro module > supports jumping within a file to find a safe position to start reading > from. > > Currently all I can do is to process it row by row, which doesn't help for > parallelisation; > > reader = DataFileReader(open("users.avro", "rb"), DatumReader()) > i = 0 > for user in reader: > i += 1 > if (i>10000): > break > > or should I switch to C or Java to process bigger files if not > spark/mapreduce ? > > Thanks, >
