Re: Caching layer using arrow

Nirmala S Mon, 08 Apr 2019 08:00:32 -0700

Sure, will try to contribute.

Using ORC adaptor, we just have the columns, a typical case is underlying 
schema is made up of multiple columns of different data types (date, float, 
int, string). Is there any optimisation to read the data row-wise without 
actually actually reading the whole file as a Table ?


I looked into below

ORCFileReader::Read(..) gives a table 
ORCFileReader::ReadStripe gives RecordBatch on which I can operate at column 
level.

Is there a way where in I can get some thing similar to RecordBatch, but as a 
row ?


> On 29-Mar-2019, at 8:23 PM, Wes McKinney <[email protected]> wrote:
> 
> hi,
> 
> On Fri, Mar 29, 2019 at 9:49 AM Nirmala S <[email protected]> wrote:
>> 
>> Thanks Wes. I do have couple more questions,
>> - When a table is read using ORC adaptor, it gets read into a memory pool(in 
>> my case default_memory_pool). How to free this area once the file is 
>> processed ?
> 
> With the default memory pool, the memory is freed automatically when
> the RecordBatch data structures are destructed.
> 
>> - Is there any way to read the ORC file metadata from adaptor ?
> 
> Doesn't look like it yet. This would be a nice contribution to the library
> 
>> 
>> 
>>> On 29-Mar-2019, at 7:18 AM, Wes McKinney <[email protected]> wrote:
>>> 
>>> The Arrow APIs are batch-based, so if you want to go record-by-record
>>> you would need to develop an interface on top of the
>>> arrow::RecordBatch data structure
>>> 
>>> On Wed, Mar 27, 2019 at 2:06 AM Nirmala S <[email protected]> 
>>> wrote:
>>>> 
>>>> Now I see there is a ORC adaptor for Arrow which can read ORC file as a 
>>>> table. With this in place, I intend to use TableBatchReader to read it.
>>>> 
>>>> How to get a single record from TableBatchReader ?
>>>> 
>>>> 
>>>>> On 22-Mar-2019, at 12:18 AM, Wes McKinney <[email protected]> wrote:
>>>>> 
>>>>> hi Nirmala,
>>>>> 
>>>>> There aren't any tools in the libraries to help you "out of the box",
>>>>> so you'll probably have to devise your own metadata storage and state
>>>>> management scheme for such a system.
>>>>> 
>>>>> best
>>>>> Wes
>>>>> 
>>>>> On Thu, Mar 21, 2019 at 9:53 AM Nirmala S <[email protected]> 
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>>      I am trying to build a caching layer using Arrow on top of ORC 
>>>>>> files. The application will ask for a column(which can be of any data 
>>>>>> type - fixed, variable length) of data from the cache, the cache needs 
>>>>>> to check it’s metadata to see if the column is already present. If yes, 
>>>>>> it can return the data to application. If not the data needs to be 
>>>>>> fetched from ORC files, cached and then returned to application. The 
>>>>>> application is multi-threaded and is based on C++. Application has a 
>>>>>> read-only workload.
>>>>>> 
>>>>>>      This being the case what is the best method to maintain the 
>>>>>> metadata and the data in Arrow, is there any good practise ?
>>>>>> 
>>>>>>      If cache size is smaller than the ORC file size, should I be 
>>>>>> putting in a logic to swap the data using some algorithm like LRU or is 
>>>>>> this already present in Arrow ?
>>>>>> 
>>>>>> 
>>>>>> Thanks in advance
>>>>>> Nirmala
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>

Re: Caching layer using arrow

Reply via email to