The equivalent for Feather: https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3
Opening the file using `read_parquet()` will read it in as a tibble, so you'll end up with the same RAM usage. If you use `read_parquet(<filename>, as_data_frame = FALSE)`, then it reads it in as an Arrow Table, which will probably use less RAM but `open_dataset()` is the best option for not loading things into memory before they're needed (Arrow Tables are in-memory, but an Arrow Dataset is a bit more like a database connection). Your error upon calling `pivot_longer()` is because it's not implemented in arrow, but it's implemented in duckdb, so you can use `to_duckdb()` to pass the data there (zero-copy), then call `pivot_longer()` and then `to_arrow()` to pass it back to arrow. `tidyr::nest()` isn't implemented in either arrow or duckdB as far as I'm aware, so that might be a stumbling block, depending on exactly what output you need at the end of your workflow. On Fri, 28 Jul 2023 at 03:41, Richard Beare <[email protected]> wrote: > I have some basics working, now for the tricky questions. > > The workflow I'll hoping to duplicate is a fairly classic tidyverse to > fitting many regression models > > wide dataframe -> pivot_longer -> nest -> mutate to create a column > containing fitted models > > At the moment I have the wide frame sitting in parquet format > > The above approach does function if I open the parquet file using > "read_parquet", but I'm not sure that I'm actually saving RAM over a > standard dataframe approach > > If I place the parquet in a dataset and use "open_dataset", I have the > following issue: > > Error in UseMethod("pivot_longer") : no applicable method for 'pivot_longer' > applied to an object of class "c('FileSystemDataset', 'Dataset', > 'ArrowObject', 'R6')" > Any recommendations on attacking this? > > Thanks > > On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]> > wrote: > >> Perfect! Thank you for that. >> >> I had not found the ParquetFileWriter class. Is there an equivalent >> feather class? >> >> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote: >> >>> Hi Richard, >>> >>> It is possible - I've created an example in this gist showing how to >>> loop through a list of files and write to a Parquet file one row at a time: >>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf. >>> >>> Does this solve your problem? >>> >>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]> >>> wrote: >>> >>>> Hi arrow experts, >>>> >>>> I have what I think should be a standard problem, but I'm not seeing >>>> the correct solution. >>>> >>>> I have data in a nonstandard form (nifti neuroimaging files) that I can >>>> load into R and transform into a single row dataframe (which is 30K >>>> columns). In a small example I can load about 80 of these into a single >>>> dataframe and save as feather or parquet without problem. I'd like to >>>> address the problem where I have thousands. >>>> >>>> The approach of loading a collection (e.g. 10) into a dataframe and >>>> saving with a hive standard name and repeating does work, but doesn't seem >>>> like the right way to do it. >>>> >>>> Is there a way to stream data, one row at a time, into a feather or >>>> parquet file? >>>> I've attempted to use write_feather with a FileOutputputStream sink, >>>> but without luch >>>> >>>
