Re: streaming many R dataframes into feather or parquet

Richard Beare Sat, 29 Jul 2023 16:48:27 -0700

Thanks again. Looks like my best option might be to keep the arrow side
simple - just provide the chunking IO - and use standard tidyverse for the
rest.


One final thing - is it possible to set the feather compression type via
RecordBatchFileWriter?


On Fri, Jul 28, 2023 at 6:21 PM Nic Crane <[email protected]> wrote:

> The equivalent for Feather:
> https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3
>
> Opening the file using `read_parquet()` will read it in as a tibble, so
> you'll end up with the same RAM usage.  If you use
> `read_parquet(<filename>, as_data_frame = FALSE)`, then it reads it in as
> an Arrow Table, which will probably use less RAM but `open_dataset()` is
> the best option for not loading things into memory before they're needed
> (Arrow Tables are in-memory, but an Arrow Dataset is a bit more like a
> database connection).
>
> Your error upon calling `pivot_longer()` is because it's not implemented
> in arrow, but it's implemented in duckdb, so you can use `to_duckdb()` to
> pass the data there (zero-copy), then call `pivot_longer()` and then
> `to_arrow()` to pass it back to arrow.
>
> `tidyr::nest()` isn't implemented in either arrow or duckdB as far as I'm
> aware, so that might be a stumbling block, depending on exactly what output
> you need at the end of your workflow.
>
>
> On Fri, 28 Jul 2023 at 03:41, Richard Beare <[email protected]>
> wrote:
>
>> I have some basics working, now for the tricky questions.
>>
>> The workflow I'll hoping to duplicate is a fairly classic tidyverse to
>> fitting many regression models
>>
>> wide dataframe -> pivot_longer -> nest -> mutate to create a column
>> containing fitted models
>>
>> At the moment I have the wide frame sitting in parquet format
>>
>> The above approach does function if I open the parquet file using
>> "read_parquet", but I'm not sure that I'm actually saving RAM over a
>> standard dataframe approach
>>
>> If I place the parquet in a dataset and use "open_dataset", I have the
>> following issue:
>>
>> Error in UseMethod("pivot_longer") : no applicable method for 'pivot_longer' 
>> applied to an object of class "c('FileSystemDataset', 'Dataset', 
>> 'ArrowObject', 'R6')"
>> Any recommendations on attacking this?
>>
>> Thanks
>>
>> On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]>
>> wrote:
>>
>>> Perfect! Thank you for that.
>>>
>>> I had not found the ParquetFileWriter class. Is there an equivalent
>>> feather class?
>>>
>>> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote:
>>>
>>>> Hi Richard,
>>>>
>>>> It is possible - I've created an example in this gist showing how to
>>>> loop through a list of files and write to a Parquet file one row at a time:
>>>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf.
>>>>
>>>> Does this solve your problem?
>>>>
>>>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi arrow experts,
>>>>>
>>>>> I have what I think should be a standard problem, but I'm not seeing
>>>>> the correct solution.
>>>>>
>>>>> I have data in a nonstandard form (nifti neuroimaging files) that I
>>>>> can load into R and transform into a single row dataframe (which is 30K
>>>>> columns). In a small example I can load about 80 of these into a single
>>>>> dataframe and save as feather or parquet without problem. I'd like to
>>>>> address the problem where I have thousands.
>>>>>
>>>>> The approach of loading a collection (e.g. 10) into a dataframe and
>>>>> saving with a hive standard name and repeating does work, but doesn't seem
>>>>> like the right way to do it.
>>>>>
>>>>> Is there a way to stream data, one row at a time, into a feather or
>>>>> parquet file?
>>>>> I've attempted to use write_feather with a FileOutputputStream sink,
>>>>> but without luch
>>>>>
>>>>

Re: streaming many R dataframes into feather or parquet

Reply via email to