Re: Caching layer using arrow

Wes McKinney Fri, 29 Mar 2019 07:54:32 -0700

hi,

On Fri, Mar 29, 2019 at 9:49 AM Nirmala S <[email protected]> wrote:
>
> Thanks Wes. I do have couple more questions,
> - When a table is read using ORC adaptor, it gets read into a memory pool(in 
> my case default_memory_pool). How to free this area once the file is 
> processed ?


With the default memory pool, the memory is freed automatically when
the RecordBatch data structures are destructed.

> - Is there any way to read the ORC file metadata from adaptor ?

Doesn't look like it yet. This would be a nice contribution to the library

>
>
> > On 29-Mar-2019, at 7:18 AM, Wes McKinney <[email protected]> wrote:
> >
> > The Arrow APIs are batch-based, so if you want to go record-by-record
> > you would need to develop an interface on top of the
> > arrow::RecordBatch data structure
> >
> > On Wed, Mar 27, 2019 at 2:06 AM Nirmala S <[email protected]> 
> > wrote:
> >>
> >> Now I see there is a ORC adaptor for Arrow which can read ORC file as a 
> >> table. With this in place, I intend to use TableBatchReader to read it.
> >>
> >> How to get a single record from TableBatchReader ?
> >>
> >>
> >>> On 22-Mar-2019, at 12:18 AM, Wes McKinney <[email protected]> wrote:
> >>>
> >>> hi Nirmala,
> >>>
> >>> There aren't any tools in the libraries to help you "out of the box",
> >>> so you'll probably have to devise your own metadata storage and state
> >>> management scheme for such a system.
> >>>
> >>> best
> >>> Wes
> >>>
> >>> On Thu, Mar 21, 2019 at 9:53 AM Nirmala S <[email protected]> 
> >>> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>       I am trying to build a caching layer using Arrow on top of ORC 
> >>>> files. The application will ask for a column(which can be of any data 
> >>>> type - fixed, variable length) of data from the cache, the cache needs 
> >>>> to check it’s metadata to see if the column is already present. If yes, 
> >>>> it can return the data to application. If not the data needs to be 
> >>>> fetched from ORC files, cached and then returned to application. The 
> >>>> application is multi-threaded and is based on C++. Application has a 
> >>>> read-only workload.
> >>>>
> >>>>       This being the case what is the best method to maintain the 
> >>>> metadata and the data in Arrow, is there any good practise ?
> >>>>
> >>>>       If cache size is smaller than the ORC file size, should I be 
> >>>> putting in a logic to swap the data using some algorithm like LRU or is 
> >>>> this already present in Arrow ?
> >>>>
> >>>>
> >>>> Thanks in advance
> >>>> Nirmala
> >>>>
> >>>>
> >>>>
> >>>>
> >>
>

Re: Caching layer using arrow

Reply via email to