Sure, will try to contribute. Using ORC adaptor, we just have the columns, a typical case is underlying schema is made up of multiple columns of different data types (date, float, int, string). Is there any optimisation to read the data row-wise without actually actually reading the whole file as a Table ?
I looked into below ORCFileReader::Read(..) gives a table ORCFileReader::ReadStripe gives RecordBatch on which I can operate at column level. Is there a way where in I can get some thing similar to RecordBatch, but as a row ? > On 29-Mar-2019, at 8:23 PM, Wes McKinney <[email protected]> wrote: > > hi, > > On Fri, Mar 29, 2019 at 9:49 AM Nirmala S <[email protected]> wrote: >> >> Thanks Wes. I do have couple more questions, >> - When a table is read using ORC adaptor, it gets read into a memory pool(in >> my case default_memory_pool). How to free this area once the file is >> processed ? > > With the default memory pool, the memory is freed automatically when > the RecordBatch data structures are destructed. > >> - Is there any way to read the ORC file metadata from adaptor ? > > Doesn't look like it yet. This would be a nice contribution to the library > >> >> >>> On 29-Mar-2019, at 7:18 AM, Wes McKinney <[email protected]> wrote: >>> >>> The Arrow APIs are batch-based, so if you want to go record-by-record >>> you would need to develop an interface on top of the >>> arrow::RecordBatch data structure >>> >>> On Wed, Mar 27, 2019 at 2:06 AM Nirmala S <[email protected]> >>> wrote: >>>> >>>> Now I see there is a ORC adaptor for Arrow which can read ORC file as a >>>> table. With this in place, I intend to use TableBatchReader to read it. >>>> >>>> How to get a single record from TableBatchReader ? >>>> >>>> >>>>> On 22-Mar-2019, at 12:18 AM, Wes McKinney <[email protected]> wrote: >>>>> >>>>> hi Nirmala, >>>>> >>>>> There aren't any tools in the libraries to help you "out of the box", >>>>> so you'll probably have to devise your own metadata storage and state >>>>> management scheme for such a system. >>>>> >>>>> best >>>>> Wes >>>>> >>>>> On Thu, Mar 21, 2019 at 9:53 AM Nirmala S <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to build a caching layer using Arrow on top of ORC >>>>>> files. The application will ask for a column(which can be of any data >>>>>> type - fixed, variable length) of data from the cache, the cache needs >>>>>> to check it’s metadata to see if the column is already present. If yes, >>>>>> it can return the data to application. If not the data needs to be >>>>>> fetched from ORC files, cached and then returned to application. The >>>>>> application is multi-threaded and is based on C++. Application has a >>>>>> read-only workload. >>>>>> >>>>>> This being the case what is the best method to maintain the >>>>>> metadata and the data in Arrow, is there any good practise ? >>>>>> >>>>>> If cache size is smaller than the ORC file size, should I be >>>>>> putting in a logic to swap the data using some algorithm like LRU or is >>>>>> this already present in Arrow ? >>>>>> >>>>>> >>>>>> Thanks in advance >>>>>> Nirmala >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>
