Re: streaming many R dataframes into feather or parquet

Nic Crane Sun, 30 Jul 2023 14:54:46 -0700

I'm pretty new to this area of the codebase myself, but it looks like this
functionality isn't currently exposed in the R bindings.


In the C++ docs, it looks like the function you'd want is
`arrow::ipc::MakeFileWriter()`[1].  This function allows you to create a
RecordBatchWriter with extra options supplied, which is where you'd specify
the compression.  However, this isn't exposed in the R bindings - this
function is initialised with the default options[2].  Please do open an
issue if this would be useful for us to expose this option, and we can take
a look at doing so.

[1] https://arrow.apache.org/docs/cpp/api/ipc.html
[2]
https://github.com/apache/arrow/blob/af23f6a2e8ece6211b087b0e4f24b9daaffbb8a9/r/src/recordbatchwriter.cpp#L47

On Sun, 30 Jul 2023 at 00:48, Richard Beare <[email protected]> wrote:

> Thanks again. Looks like my best option might be to keep the arrow side
> simple - just provide the chunking IO - and use standard tidyverse for the
> rest.
>
> One final thing - is it possible to set the feather compression type via
> RecordBatchFileWriter?
>
>
> On Fri, Jul 28, 2023 at 6:21 PM Nic Crane <[email protected]> wrote:
>
>> The equivalent for Feather:
>> https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3
>>
>> Opening the file using `read_parquet()` will read it in as a tibble, so
>> you'll end up with the same RAM usage.  If you use
>> `read_parquet(<filename>, as_data_frame = FALSE)`, then it reads it in as
>> an Arrow Table, which will probably use less RAM but `open_dataset()` is
>> the best option for not loading things into memory before they're needed
>> (Arrow Tables are in-memory, but an Arrow Dataset is a bit more like a
>> database connection).
>>
>> Your error upon calling `pivot_longer()` is because it's not implemented
>> in arrow, but it's implemented in duckdb, so you can use `to_duckdb()` to
>> pass the data there (zero-copy), then call `pivot_longer()` and then
>> `to_arrow()` to pass it back to arrow.
>>
>> `tidyr::nest()` isn't implemented in either arrow or duckdB as far as I'm
>> aware, so that might be a stumbling block, depending on exactly what output
>> you need at the end of your workflow.
>>
>>
>> On Fri, 28 Jul 2023 at 03:41, Richard Beare <[email protected]>
>> wrote:
>>
>>> I have some basics working, now for the tricky questions.
>>>
>>> The workflow I'll hoping to duplicate is a fairly classic tidyverse to
>>> fitting many regression models
>>>
>>> wide dataframe -> pivot_longer -> nest -> mutate to create a column
>>> containing fitted models
>>>
>>> At the moment I have the wide frame sitting in parquet format
>>>
>>> The above approach does function if I open the parquet file using
>>> "read_parquet", but I'm not sure that I'm actually saving RAM over a
>>> standard dataframe approach
>>>
>>> If I place the parquet in a dataset and use "open_dataset", I have the
>>> following issue:
>>>
>>> Error in UseMethod("pivot_longer") : no applicable method for 
>>> 'pivot_longer' applied to an object of class "c('FileSystemDataset', 
>>> 'Dataset', 'ArrowObject', 'R6')"
>>> Any recommendations on attacking this?
>>>
>>> Thanks
>>>
>>> On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]>
>>> wrote:
>>>
>>>> Perfect! Thank you for that.
>>>>
>>>> I had not found the ParquetFileWriter class. Is there an equivalent
>>>> feather class?
>>>>
>>>> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote:
>>>>
>>>>> Hi Richard,
>>>>>
>>>>> It is possible - I've created an example in this gist showing how to
>>>>> loop through a list of files and write to a Parquet file one row at a 
>>>>> time:
>>>>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf.
>>>>>
>>>>> Does this solve your problem?
>>>>>
>>>>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi arrow experts,
>>>>>>
>>>>>> I have what I think should be a standard problem, but I'm not seeing
>>>>>> the correct solution.
>>>>>>
>>>>>> I have data in a nonstandard form (nifti neuroimaging files) that I
>>>>>> can load into R and transform into a single row dataframe (which is 30K
>>>>>> columns). In a small example I can load about 80 of these into a single
>>>>>> dataframe and save as feather or parquet without problem. I'd like to
>>>>>> address the problem where I have thousands.
>>>>>>
>>>>>> The approach of loading a collection (e.g. 10) into a dataframe and
>>>>>> saving with a hive standard name and repeating does work, but doesn't 
>>>>>> seem
>>>>>> like the right way to do it.
>>>>>>
>>>>>> Is there a way to stream data, one row at a time, into a feather or
>>>>>> parquet file?
>>>>>> I've attempted to use write_feather with a FileOutputputStream sink,
>>>>>> but without luch
>>>>>>
>>>>>

Re: streaming many R dataframes into feather or parquet

Reply via email to