How to read ORC file in chunks

Jeyhun Karimov Wed, 27 Oct 2021 01:18:20 -0700

Hi guys,

I am trying to read ORC files in chunks. I have a very large ORC file on
disk (say 100GB) and very limited memory (e.g., I can buffer max 1MB data
in memory). I want to scan ORC file intelligently:


   1. read footer
   2. get addresses of stripes
   3. read first stripe's metadata (footer) and apply some filters
   4. read first stripe's index
   5. read first stripe's data (chunk by chunk - 1MB at a time)
   6. Move to the next stripe

I have tried to use MemoryInputStream.hh from the ORC repo:

https://github.com/apache/orc/blob/main/c++/test/MemoryInputStream.hh

However, while reading the data, its read method tries to access large
amounts of data (beyond 1MB).

    virtual void read(void* buf, uint64_t length, uint64_t offset) override {
      memcpy(buf, buffer + offset, length);
    }



So, is there a way to read/aceess different parts of the ORC file
incrementally and with a limited in-memory buffer? Or should I materialize
all orc file in local disk or in memory?

Thanks!

Jeyhun

How to read ORC file in chunks

Reply via email to