Weston: Thank you for the detailed reply. This gives me a much better sense as to where to keep looking, and where behavior I'm seeing could be a bug. I have tried `arrow::default_memory_pool()$bytes_allocated` and I get 1048 both before and after write_dataset, which seems to confirm that it is not arrow memory allocation. I will keep investigating, and this hint has given me a clue that more parts of the C++ functions are accessible from R than I realized. And I will certainly follow continuing progress in this area.
Benson: Thanks for this idea. A bigger machine is certainly a possibility, though for the time being I need to make do. best, Jameel On Tue, Oct 19, 2021 at 2:03 AM Benson Muite <[email protected]> wrote: > Might you be able to run your analysis in batch scripts[1] rather than > using RStudio? If so, this might make it easier to run in the cloud > where large shared memory compute instances (upto 350GB) are available. > The script you have for the NYC taxi dataset should run using batch > processing. > > Using batch scripts might also make it easier to use R using distributed > memory. > > [1] https://rdrr.io/r/utils/BATCH.html > > On 10/19/21 8:13 AM, Weston Pace wrote: > > Thank you for your interest! As you are discovering it can be a > > slightly tricky topic. Disclaimer: I am neither an R expert or a > > Windows expert but I have worked some on the underlying C++ feature. > > > > Some of the limitations you are running into are probably bugs and not > > intentional limitations. In particular, there was a regression in > > 4.0.0 and 5.0.0 [1] which meant that a slow consumer of data (writing > > a dataset is typically a slow consumer) could lead to Arrow using too > > much RAM (in such a case it needs to backoff on the read and it was > > not doing this) and eventually crashing out of memory. This issue has > > been recently fixed and, using a very recent build, I confirmed that I > > was able to read a 30GB CSV dataset and write it to parquet. The peak > > RAM usage of the Arrow process was 1GB. Note, this was on Linux and > > not Windows but this shouldn't matter. > > > > Some of this may be due to the way that the OS handles disk I/O. As > > users, we are typically used to smallish disk writes (e.g. a couple of > > GB or less) and, from a user perspective, these writes are often > > non-blocking and very fast. In reality the write is only pushing the > > data into the OS' disk cache and then returning as soon as that memcpy > > is done. The actual write to the physical disk happens behind the > > scenes (and can even happen outside the lifetime of the process). By > > default, this disk cache is (I think, not sure for Windows) allowed to > > consume all available RAM. Once that happens additional writes (and > > possible regular allocations) will be slowed down while the OS waits > > for the data to be persisted to disk so that the RAM can be used. > > > > At the moment, Arrow does nothing to prevent this OS cache from > > filling up. This may be something we can investigate in future > > releases, it is an interesting question what the best behavior is. > > > >> When working with dplyr & datasets, are there parameters that determine > whether operations can be performed in a streaming/iterative form that is > needed when data is much bigger than memory? > > > > There are a few operations which are not implemented by Arrow and I > > believe when this situation is encountered Arrow will load the entire > > dataset into memory and apply the operation in R. This would > > definitely be a bad thing for your goals. I'm not sure the exact > > details of what operations will trigger this but the basic select / > > rename operations you have should be ok. > > > > There are also a few operations which are implemented in arrow but > > will force the data to be buffered in memory. These are arrange (or > > anything ordering data like top-k) and join. > > > >> I wasn't expecting write_dataset to continue consuming memory when > finished. I don't think gc() or pryr functions are able to clear or measure > memory used by Arrow. Are there different tools I should be using here? > Maybe I need to be telling Arrow to limit usage somehow? > > > > After write_dataset is finished the process should not be holding on > > to any memory. If it is doing so then that is a bug. However, the OS > > may still be holding onto data that is in the disk cache waiting to be > > flushed to disk. A good quick test is to check > > "arrow::default_memory_pool()$bytes_allocated". This will report how > > much memory Arrow believes it is using. If this is 0 then that is a > > good bet (though by no means a guarantee) that anything the Arrow > > system library has allocated has been released. In Windows you might > > be able to use the program RAMMap [3] might give you some more > > information on how much data is in the disk cache. > > > >> The current documentation for write_dataset says you can't rename while > writing -- in my experience this did work. Is the reason for this note that > in order to rename, Arrow will change the dataset to an in-memory Table? > Based on my test, the memory usage didn't seem less, but this was one of my > theories of what was going on. > > > > The note here is quite old and the functions it describes have been > > changed a lot in the last year. My guess is this is a relic from the > > time that dplyr functions were handled differently. Maybe someone > > else can chime in to verify. From what I can tell a rename in R is > > translated into a select which is translated into a project in the C++ > > layer. A project operation should be able to operate in a streaming > > fashion and will not force the memory to buffer. > > > > In summary, out of core processing is something I think many of us are > > interested in and want to support. Basic out-of-core manipulation > > (repartitioning, transforming from one format to another) should be > > pretty well supported in 6.0.0 but it might consume all available OS > > RAM as disk cache. Out of core work in general is still getting > > started and you will hopefully continue to see improvements as we work > > on them. For example, I hope future releases will be be able to > > support out-of-core joins and ordering statements by spilling to disk. > > > > [1] https://issues.apache.org/jira/browse/ARROW-13611 > > [2] https://docs.microsoft.com/en-us/windows/win32/fileio/file-caching > > [3] https://docs.microsoft.com/en-us/sysinternals/downloads/rammap > > > > On Mon, Oct 18, 2021 at 11:34 AM Jameel Alsalam <[email protected]> > wrote: > >> > >> Hello, > >> > >> I am a (learning) user of the Arrow R package on Windows. I am > currently focused on using Arrow to do data preparation on > bigger-than-my-memory set of csv files, transform them into parquet files, > for further analysis with DuckDB. I have about 600 csv files, totaling > about 200 GBs which had been dumped out of a database. I've had luck doing > some of this, but for the biggest table I am struggling with understanding > when Arrow may fill memory and grind to a halt, versus when I should expect > that Arrow can iterate through. > >> > >> For reproducibility purposes, I did some working with the nyc-taxi > dataset down below. These do not fill my memory, but they do use up more > than I expected, and I don't know how to free it without restarting the R > session. > >> > >> My questions: > >> 1) When working with dplyr & datasets, are there parameters that > determine whether operations can be performed in a streaming/iterative form > that is needed when data is much bigger than memory? > >> 2) I wasn't expecting write_dataset to continue consuming memory when > finished. I don't think gc() or pryr functions are able to clear or measure > memory used by Arrow. Are there different tools I should be using here? > Maybe I need to be telling Arrow to limit usage somehow? > >> 3) The current documentation for write_dataset says you can't rename > while writing -- in my experience this did work. Is the reason for this > note that in order to rename, Arrow will change the dataset to an in-memory > Table? Based on my test, the memory usage didn't seem less, but this was > one of my theories of what was going on. > >> > >> thanks, > >> Jameel > >> > >> ``` > >> #### Read dataset -> write dataset --------- > >> > >> library(tidyverse) > >> library(arrow) > >> library(duckdb) > >> > >> # Do I understand the limitations of out of memory dataset > manipulations? > >> > >> packageVersion("arrow") > >> # [1] ‘5.0.0.20211016’ > >> > >> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) > >> > >> # The documentation for write_dataset says you can't rename in the > process of writing > >> # In @param dataset: > >> # "Note that select()-ed columns may not be renamed." > >> > >> ds %>% > >> select(vendor_id, pickup_at, dropoff_at, year, month) %>% > >> rename( > >> pickup_dttm = pickup_at, > >> dropoff_dttm = dropoff_at > >> ) %>% > >> write_dataset("nyc-taxi-mod", partitioning = c("year", "month")) > >> > >> # Starting memory usage: 420 MB (task manager - RStudio/R) > >> # Ending memory usage: 12,100 MB (task manager - RStudio/R) > >> > >> # it does _work_, but a lot more memory is used. Task manager sees the > memory as used by the > >> # RStudio session, but Rstudio sees the memory as used by the system. I > am assuming it is Arrow, > >> # but I'm not sure how to control this, as e.g., there is no gc() for > Arrow. > >> > >> # RESTART R SESSION HERE TO RECOVER MEMORY > >> > >> # Its possible that out of memory dataset operations can't use rename. > >> > >> # If you do not rename, and only select: > >> ds %>% > >> select(vendor_id, pickup_at, dropoff_at, year, month) %>% > >> write_dataset("nyc-taxi-mod", partitioning = c("year", "month")) > >> > >> # starting memory usage: 425 MB (Task manager - for Rstudio/R) > >> # end usage: 10,600 MB (task manager - for Rstudio/R) > >> ``` > >> > >> > >> > >> -- > >> Jameel Alsalam > >> (510) 717-9637 > >> [email protected] > > -- Jameel Alsalam (510) 717-9637 [email protected]
