Re: streaming many R dataframes into feather or parquet

Nic Crane Fri, 28 Jul 2023 01:21:18 -0700

The equivalent for Feather:
https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3


Opening the file using `read_parquet()` will read it in as a tibble, so
you'll end up with the same RAM usage.  If you use
`read_parquet(<filename>, as_data_frame = FALSE)`, then it reads it in as
an Arrow Table, which will probably use less RAM but `open_dataset()` is
the best option for not loading things into memory before they're needed
(Arrow Tables are in-memory, but an Arrow Dataset is a bit more like a
database connection).

Your error upon calling `pivot_longer()` is because it's not implemented in
arrow, but it's implemented in duckdb, so you can use `to_duckdb()` to pass
the data there (zero-copy), then call `pivot_longer()` and then
`to_arrow()` to pass it back to arrow.

`tidyr::nest()` isn't implemented in either arrow or duckdB as far as I'm
aware, so that might be a stumbling block, depending on exactly what output
you need at the end of your workflow.


On Fri, 28 Jul 2023 at 03:41, Richard Beare <[email protected]> wrote:

> I have some basics working, now for the tricky questions.
>
> The workflow I'll hoping to duplicate is a fairly classic tidyverse to
> fitting many regression models
>
> wide dataframe -> pivot_longer -> nest -> mutate to create a column
> containing fitted models
>
> At the moment I have the wide frame sitting in parquet format
>
> The above approach does function if I open the parquet file using
> "read_parquet", but I'm not sure that I'm actually saving RAM over a
> standard dataframe approach
>
> If I place the parquet in a dataset and use "open_dataset", I have the
> following issue:
>
> Error in UseMethod("pivot_longer") : no applicable method for 'pivot_longer' 
> applied to an object of class "c('FileSystemDataset', 'Dataset', 
> 'ArrowObject', 'R6')"
> Any recommendations on attacking this?
>
> Thanks
>
> On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]>
> wrote:
>
>> Perfect! Thank you for that.
>>
>> I had not found the ParquetFileWriter class. Is there an equivalent
>> feather class?
>>
>> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote:
>>
>>> Hi Richard,
>>>
>>> It is possible - I've created an example in this gist showing how to
>>> loop through a list of files and write to a Parquet file one row at a time:
>>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf.
>>>
>>> Does this solve your problem?
>>>
>>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]>
>>> wrote:
>>>
>>>> Hi arrow experts,
>>>>
>>>> I have what I think should be a standard problem, but I'm not seeing
>>>> the correct solution.
>>>>
>>>> I have data in a nonstandard form (nifti neuroimaging files) that I can
>>>> load into R and transform into a single row dataframe (which is 30K
>>>> columns). In a small example I can load about 80 of these into a single
>>>> dataframe and save as feather or parquet without problem. I'd like to
>>>> address the problem where I have thousands.
>>>>
>>>> The approach of loading a collection (e.g. 10) into a dataframe and
>>>> saving with a hive standard name and repeating does work, but doesn't seem
>>>> like the right way to do it.
>>>>
>>>> Is there a way to stream data, one row at a time, into a feather or
>>>> parquet file?
>>>> I've attempted to use write_feather with a FileOutputputStream sink,
>>>> but without luch
>>>>
>>>

Re: streaming many R dataframes into feather or parquet

Reply via email to