Thank you for your interest! As you are discovering it can be a slightly tricky topic. Disclaimer: I am neither an R expert or a Windows expert but I have worked some on the underlying C++ feature.
Some of the limitations you are running into are probably bugs and not intentional limitations. In particular, there was a regression in 4.0.0 and 5.0.0 [1] which meant that a slow consumer of data (writing a dataset is typically a slow consumer) could lead to Arrow using too much RAM (in such a case it needs to backoff on the read and it was not doing this) and eventually crashing out of memory. This issue has been recently fixed and, using a very recent build, I confirmed that I was able to read a 30GB CSV dataset and write it to parquet. The peak RAM usage of the Arrow process was 1GB. Note, this was on Linux and not Windows but this shouldn't matter. Some of this may be due to the way that the OS handles disk I/O. As users, we are typically used to smallish disk writes (e.g. a couple of GB or less) and, from a user perspective, these writes are often non-blocking and very fast. In reality the write is only pushing the data into the OS' disk cache and then returning as soon as that memcpy is done. The actual write to the physical disk happens behind the scenes (and can even happen outside the lifetime of the process). By default, this disk cache is (I think, not sure for Windows) allowed to consume all available RAM. Once that happens additional writes (and possible regular allocations) will be slowed down while the OS waits for the data to be persisted to disk so that the RAM can be used. At the moment, Arrow does nothing to prevent this OS cache from filling up. This may be something we can investigate in future releases, it is an interesting question what the best behavior is. > When working with dplyr & datasets, are there parameters that determine > whether operations can be performed in a streaming/iterative form that is > needed when data is much bigger than memory? There are a few operations which are not implemented by Arrow and I believe when this situation is encountered Arrow will load the entire dataset into memory and apply the operation in R. This would definitely be a bad thing for your goals. I'm not sure the exact details of what operations will trigger this but the basic select / rename operations you have should be ok. There are also a few operations which are implemented in arrow but will force the data to be buffered in memory. These are arrange (or anything ordering data like top-k) and join. > I wasn't expecting write_dataset to continue consuming memory when finished. > I don't think gc() or pryr functions are able to clear or measure memory used > by Arrow. Are there different tools I should be using here? Maybe I need to > be telling Arrow to limit usage somehow? After write_dataset is finished the process should not be holding on to any memory. If it is doing so then that is a bug. However, the OS may still be holding onto data that is in the disk cache waiting to be flushed to disk. A good quick test is to check "arrow::default_memory_pool()$bytes_allocated". This will report how much memory Arrow believes it is using. If this is 0 then that is a good bet (though by no means a guarantee) that anything the Arrow system library has allocated has been released. In Windows you might be able to use the program RAMMap [3] might give you some more information on how much data is in the disk cache. > The current documentation for write_dataset says you can't rename while > writing -- in my experience this did work. Is the reason for this note that > in order to rename, Arrow will change the dataset to an in-memory Table? > Based on my test, the memory usage didn't seem less, but this was one of my > theories of what was going on. The note here is quite old and the functions it describes have been changed a lot in the last year. My guess is this is a relic from the time that dplyr functions were handled differently. Maybe someone else can chime in to verify. From what I can tell a rename in R is translated into a select which is translated into a project in the C++ layer. A project operation should be able to operate in a streaming fashion and will not force the memory to buffer. In summary, out of core processing is something I think many of us are interested in and want to support. Basic out-of-core manipulation (repartitioning, transforming from one format to another) should be pretty well supported in 6.0.0 but it might consume all available OS RAM as disk cache. Out of core work in general is still getting started and you will hopefully continue to see improvements as we work on them. For example, I hope future releases will be be able to support out-of-core joins and ordering statements by spilling to disk. [1] https://issues.apache.org/jira/browse/ARROW-13611 [2] https://docs.microsoft.com/en-us/windows/win32/fileio/file-caching [3] https://docs.microsoft.com/en-us/sysinternals/downloads/rammap On Mon, Oct 18, 2021 at 11:34 AM Jameel Alsalam <[email protected]> wrote: > > Hello, > > I am a (learning) user of the Arrow R package on Windows. I am currently > focused on using Arrow to do data preparation on bigger-than-my-memory set of > csv files, transform them into parquet files, for further analysis with > DuckDB. I have about 600 csv files, totaling about 200 GBs which had been > dumped out of a database. I've had luck doing some of this, but for the > biggest table I am struggling with understanding when Arrow may fill memory > and grind to a halt, versus when I should expect that Arrow can iterate > through. > > For reproducibility purposes, I did some working with the nyc-taxi dataset > down below. These do not fill my memory, but they do use up more than I > expected, and I don't know how to free it without restarting the R session. > > My questions: > 1) When working with dplyr & datasets, are there parameters that determine > whether operations can be performed in a streaming/iterative form that is > needed when data is much bigger than memory? > 2) I wasn't expecting write_dataset to continue consuming memory when > finished. I don't think gc() or pryr functions are able to clear or measure > memory used by Arrow. Are there different tools I should be using here? Maybe > I need to be telling Arrow to limit usage somehow? > 3) The current documentation for write_dataset says you can't rename while > writing -- in my experience this did work. Is the reason for this note that > in order to rename, Arrow will change the dataset to an in-memory Table? > Based on my test, the memory usage didn't seem less, but this was one of my > theories of what was going on. > > thanks, > Jameel > > ``` > #### Read dataset -> write dataset --------- > > library(tidyverse) > library(arrow) > library(duckdb) > > # Do I understand the limitations of out of memory dataset manipulations? > > packageVersion("arrow") > # [1] ‘5.0.0.20211016’ > > ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) > > # The documentation for write_dataset says you can't rename in the process of > writing > # In @param dataset: > # "Note that select()-ed columns may not be renamed." > > ds %>% > select(vendor_id, pickup_at, dropoff_at, year, month) %>% > rename( > pickup_dttm = pickup_at, > dropoff_dttm = dropoff_at > ) %>% > write_dataset("nyc-taxi-mod", partitioning = c("year", "month")) > > # Starting memory usage: 420 MB (task manager - RStudio/R) > # Ending memory usage: 12,100 MB (task manager - RStudio/R) > > # it does _work_, but a lot more memory is used. Task manager sees the memory > as used by the > # RStudio session, but Rstudio sees the memory as used by the system. I am > assuming it is Arrow, > # but I'm not sure how to control this, as e.g., there is no gc() for Arrow. > > # RESTART R SESSION HERE TO RECOVER MEMORY > > # Its possible that out of memory dataset operations can't use rename. > > # If you do not rename, and only select: > ds %>% > select(vendor_id, pickup_at, dropoff_at, year, month) %>% > write_dataset("nyc-taxi-mod", partitioning = c("year", "month")) > > # starting memory usage: 425 MB (Task manager - for Rstudio/R) > # end usage: 10,600 MB (task manager - for Rstudio/R) > ``` > > > > -- > Jameel Alsalam > (510) 717-9637 > [email protected]
