Thanks again. Looks like my best option might be to keep the arrow side simple - just provide the chunking IO - and use standard tidyverse for the rest.
One final thing - is it possible to set the feather compression type via RecordBatchFileWriter? On Fri, Jul 28, 2023 at 6:21 PM Nic Crane <[email protected]> wrote: > The equivalent for Feather: > https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3 > > Opening the file using `read_parquet()` will read it in as a tibble, so > you'll end up with the same RAM usage. If you use > `read_parquet(<filename>, as_data_frame = FALSE)`, then it reads it in as > an Arrow Table, which will probably use less RAM but `open_dataset()` is > the best option for not loading things into memory before they're needed > (Arrow Tables are in-memory, but an Arrow Dataset is a bit more like a > database connection). > > Your error upon calling `pivot_longer()` is because it's not implemented > in arrow, but it's implemented in duckdb, so you can use `to_duckdb()` to > pass the data there (zero-copy), then call `pivot_longer()` and then > `to_arrow()` to pass it back to arrow. > > `tidyr::nest()` isn't implemented in either arrow or duckdB as far as I'm > aware, so that might be a stumbling block, depending on exactly what output > you need at the end of your workflow. > > > On Fri, 28 Jul 2023 at 03:41, Richard Beare <[email protected]> > wrote: > >> I have some basics working, now for the tricky questions. >> >> The workflow I'll hoping to duplicate is a fairly classic tidyverse to >> fitting many regression models >> >> wide dataframe -> pivot_longer -> nest -> mutate to create a column >> containing fitted models >> >> At the moment I have the wide frame sitting in parquet format >> >> The above approach does function if I open the parquet file using >> "read_parquet", but I'm not sure that I'm actually saving RAM over a >> standard dataframe approach >> >> If I place the parquet in a dataset and use "open_dataset", I have the >> following issue: >> >> Error in UseMethod("pivot_longer") : no applicable method for 'pivot_longer' >> applied to an object of class "c('FileSystemDataset', 'Dataset', >> 'ArrowObject', 'R6')" >> Any recommendations on attacking this? >> >> Thanks >> >> On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]> >> wrote: >> >>> Perfect! Thank you for that. >>> >>> I had not found the ParquetFileWriter class. Is there an equivalent >>> feather class? >>> >>> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote: >>> >>>> Hi Richard, >>>> >>>> It is possible - I've created an example in this gist showing how to >>>> loop through a list of files and write to a Parquet file one row at a time: >>>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf. >>>> >>>> Does this solve your problem? >>>> >>>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]> >>>> wrote: >>>> >>>>> Hi arrow experts, >>>>> >>>>> I have what I think should be a standard problem, but I'm not seeing >>>>> the correct solution. >>>>> >>>>> I have data in a nonstandard form (nifti neuroimaging files) that I >>>>> can load into R and transform into a single row dataframe (which is 30K >>>>> columns). In a small example I can load about 80 of these into a single >>>>> dataframe and save as feather or parquet without problem. I'd like to >>>>> address the problem where I have thousands. >>>>> >>>>> The approach of loading a collection (e.g. 10) into a dataframe and >>>>> saving with a hive standard name and repeating does work, but doesn't seem >>>>> like the right way to do it. >>>>> >>>>> Is there a way to stream data, one row at a time, into a feather or >>>>> parquet file? >>>>> I've attempted to use write_feather with a FileOutputputStream sink, >>>>> but without luch >>>>> >>>>
