Re: streaming many R dataframes into feather or parquet

Richard Beare Thu, 27 Jul 2023 19:41:26 -0700

I have some basics working, now for the tricky questions.

The workflow I'll hoping to duplicate is a fairly classic tidyverse to
fitting many regression models


wide dataframe -> pivot_longer -> nest -> mutate to create a column
containing fitted models

At the moment I have the wide frame sitting in parquet format

The above approach does function if I open the parquet file using
"read_parquet", but I'm not sure that I'm actually saving RAM over a
standard dataframe approach

If I place the parquet in a dataset and use "open_dataset", I have the
following issue:

Error in UseMethod("pivot_longer") : no applicable method for
'pivot_longer' applied to an object of class "c('FileSystemDataset',
'Dataset', 'ArrowObject', 'R6')"
Any recommendations on attacking this?

Thanks

On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]>
wrote:

> Perfect! Thank you for that.
>
> I had not found the ParquetFileWriter class. Is there an equivalent
> feather class?
>
> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote:
>
>> Hi Richard,
>>
>> It is possible - I've created an example in this gist showing how to loop
>> through a list of files and write to a Parquet file one row at a time:
>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf.
>>
>> Does this solve your problem?
>>
>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]>
>> wrote:
>>
>>> Hi arrow experts,
>>>
>>> I have what I think should be a standard problem, but I'm not seeing the
>>> correct solution.
>>>
>>> I have data in a nonstandard form (nifti neuroimaging files) that I can
>>> load into R and transform into a single row dataframe (which is 30K
>>> columns). In a small example I can load about 80 of these into a single
>>> dataframe and save as feather or parquet without problem. I'd like to
>>> address the problem where I have thousands.
>>>
>>> The approach of loading a collection (e.g. 10) into a dataframe and
>>> saving with a hive standard name and repeating does work, but doesn't seem
>>> like the right way to do it.
>>>
>>> Is there a way to stream data, one row at a time, into a feather or
>>> parquet file?
>>> I've attempted to use write_feather with a FileOutputputStream sink, but
>>> without luch
>>>
>>

Re: streaming many R dataframes into feather or parquet

Reply via email to