I'm pretty new to this area of the codebase myself, but it looks like this functionality isn't currently exposed in the R bindings.
In the C++ docs, it looks like the function you'd want is `arrow::ipc::MakeFileWriter()`[1]. This function allows you to create a RecordBatchWriter with extra options supplied, which is where you'd specify the compression. However, this isn't exposed in the R bindings - this function is initialised with the default options[2]. Please do open an issue if this would be useful for us to expose this option, and we can take a look at doing so. [1] https://arrow.apache.org/docs/cpp/api/ipc.html [2] https://github.com/apache/arrow/blob/af23f6a2e8ece6211b087b0e4f24b9daaffbb8a9/r/src/recordbatchwriter.cpp#L47 On Sun, 30 Jul 2023 at 00:48, Richard Beare <[email protected]> wrote: > Thanks again. Looks like my best option might be to keep the arrow side > simple - just provide the chunking IO - and use standard tidyverse for the > rest. > > One final thing - is it possible to set the feather compression type via > RecordBatchFileWriter? > > > On Fri, Jul 28, 2023 at 6:21 PM Nic Crane <[email protected]> wrote: > >> The equivalent for Feather: >> https://gist.github.com/thisisnic/f73196d490a9f661269b5403292dddc3 >> >> Opening the file using `read_parquet()` will read it in as a tibble, so >> you'll end up with the same RAM usage. If you use >> `read_parquet(<filename>, as_data_frame = FALSE)`, then it reads it in as >> an Arrow Table, which will probably use less RAM but `open_dataset()` is >> the best option for not loading things into memory before they're needed >> (Arrow Tables are in-memory, but an Arrow Dataset is a bit more like a >> database connection). >> >> Your error upon calling `pivot_longer()` is because it's not implemented >> in arrow, but it's implemented in duckdb, so you can use `to_duckdb()` to >> pass the data there (zero-copy), then call `pivot_longer()` and then >> `to_arrow()` to pass it back to arrow. >> >> `tidyr::nest()` isn't implemented in either arrow or duckdB as far as I'm >> aware, so that might be a stumbling block, depending on exactly what output >> you need at the end of your workflow. >> >> >> On Fri, 28 Jul 2023 at 03:41, Richard Beare <[email protected]> >> wrote: >> >>> I have some basics working, now for the tricky questions. >>> >>> The workflow I'll hoping to duplicate is a fairly classic tidyverse to >>> fitting many regression models >>> >>> wide dataframe -> pivot_longer -> nest -> mutate to create a column >>> containing fitted models >>> >>> At the moment I have the wide frame sitting in parquet format >>> >>> The above approach does function if I open the parquet file using >>> "read_parquet", but I'm not sure that I'm actually saving RAM over a >>> standard dataframe approach >>> >>> If I place the parquet in a dataset and use "open_dataset", I have the >>> following issue: >>> >>> Error in UseMethod("pivot_longer") : no applicable method for >>> 'pivot_longer' applied to an object of class "c('FileSystemDataset', >>> 'Dataset', 'ArrowObject', 'R6')" >>> Any recommendations on attacking this? >>> >>> Thanks >>> >>> On Fri, Jul 28, 2023 at 10:29 AM Richard Beare <[email protected]> >>> wrote: >>> >>>> Perfect! Thank you for that. >>>> >>>> I had not found the ParquetFileWriter class. Is there an equivalent >>>> feather class? >>>> >>>> On Fri, Jul 28, 2023 at 7:59 AM Nic Crane <[email protected]> wrote: >>>> >>>>> Hi Richard, >>>>> >>>>> It is possible - I've created an example in this gist showing how to >>>>> loop through a list of files and write to a Parquet file one row at a >>>>> time: >>>>> https://gist.github.com/thisisnic/5bdb85d2742bc318433f2f14b8bd77cf. >>>>> >>>>> Does this solve your problem? >>>>> >>>>> On Thu, 27 Jul 2023 at 12:22, Richard Beare <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi arrow experts, >>>>>> >>>>>> I have what I think should be a standard problem, but I'm not seeing >>>>>> the correct solution. >>>>>> >>>>>> I have data in a nonstandard form (nifti neuroimaging files) that I >>>>>> can load into R and transform into a single row dataframe (which is 30K >>>>>> columns). In a small example I can load about 80 of these into a single >>>>>> dataframe and save as feather or parquet without problem. I'd like to >>>>>> address the problem where I have thousands. >>>>>> >>>>>> The approach of loading a collection (e.g. 10) into a dataframe and >>>>>> saving with a hive standard name and repeating does work, but doesn't >>>>>> seem >>>>>> like the right way to do it. >>>>>> >>>>>> Is there a way to stream data, one row at a time, into a feather or >>>>>> parquet file? >>>>>> I've attempted to use write_feather with a FileOutputputStream sink, >>>>>> but without luch >>>>>> >>>>>
