One other note: in 6.0.0 (scheduled to be released in a matter of days), you'll be able to query Arrow Datasets directly with DuckDB, without having to first write to disk and load into duckdb.
Neal On Tue, Oct 19, 2021 at 8:40 AM Jameel Alsalam <[email protected]> wrote: > Weston: Thank you for the detailed reply. This gives me a much better > sense as to where to keep looking, and where behavior I'm seeing could be a > bug. I have tried `arrow::default_memory_pool()$bytes_allocated` and I get > 1048 both before and after write_dataset, which seems to confirm that it is > not arrow memory allocation. I will keep investigating, and this hint has > given me a clue that more parts of the C++ functions are accessible from R > than I realized. And I will certainly follow continuing progress in this > area. > > Benson: Thanks for this idea. A bigger machine is certainly a possibility, > though for the time being I need to make do. > > best, > Jameel > > On Tue, Oct 19, 2021 at 2:03 AM Benson Muite <[email protected]> > wrote: > >> Might you be able to run your analysis in batch scripts[1] rather than >> using RStudio? If so, this might make it easier to run in the cloud >> where large shared memory compute instances (upto 350GB) are available. >> The script you have for the NYC taxi dataset should run using batch >> processing. >> >> Using batch scripts might also make it easier to use R using distributed >> memory. >> >> [1] https://rdrr.io/r/utils/BATCH.html >> >> On 10/19/21 8:13 AM, Weston Pace wrote: >> > Thank you for your interest! As you are discovering it can be a >> > slightly tricky topic. Disclaimer: I am neither an R expert or a >> > Windows expert but I have worked some on the underlying C++ feature. >> > >> > Some of the limitations you are running into are probably bugs and not >> > intentional limitations. In particular, there was a regression in >> > 4.0.0 and 5.0.0 [1] which meant that a slow consumer of data (writing >> > a dataset is typically a slow consumer) could lead to Arrow using too >> > much RAM (in such a case it needs to backoff on the read and it was >> > not doing this) and eventually crashing out of memory. This issue has >> > been recently fixed and, using a very recent build, I confirmed that I >> > was able to read a 30GB CSV dataset and write it to parquet. The peak >> > RAM usage of the Arrow process was 1GB. Note, this was on Linux and >> > not Windows but this shouldn't matter. >> > >> > Some of this may be due to the way that the OS handles disk I/O. As >> > users, we are typically used to smallish disk writes (e.g. a couple of >> > GB or less) and, from a user perspective, these writes are often >> > non-blocking and very fast. In reality the write is only pushing the >> > data into the OS' disk cache and then returning as soon as that memcpy >> > is done. The actual write to the physical disk happens behind the >> > scenes (and can even happen outside the lifetime of the process). By >> > default, this disk cache is (I think, not sure for Windows) allowed to >> > consume all available RAM. Once that happens additional writes (and >> > possible regular allocations) will be slowed down while the OS waits >> > for the data to be persisted to disk so that the RAM can be used. >> > >> > At the moment, Arrow does nothing to prevent this OS cache from >> > filling up. This may be something we can investigate in future >> > releases, it is an interesting question what the best behavior is. >> > >> >> When working with dplyr & datasets, are there parameters that >> determine whether operations can be performed in a streaming/iterative form >> that is needed when data is much bigger than memory? >> > >> > There are a few operations which are not implemented by Arrow and I >> > believe when this situation is encountered Arrow will load the entire >> > dataset into memory and apply the operation in R. This would >> > definitely be a bad thing for your goals. I'm not sure the exact >> > details of what operations will trigger this but the basic select / >> > rename operations you have should be ok. >> > >> > There are also a few operations which are implemented in arrow but >> > will force the data to be buffered in memory. These are arrange (or >> > anything ordering data like top-k) and join. >> > >> >> I wasn't expecting write_dataset to continue consuming memory when >> finished. I don't think gc() or pryr functions are able to clear or measure >> memory used by Arrow. Are there different tools I should be using here? >> Maybe I need to be telling Arrow to limit usage somehow? >> > >> > After write_dataset is finished the process should not be holding on >> > to any memory. If it is doing so then that is a bug. However, the OS >> > may still be holding onto data that is in the disk cache waiting to be >> > flushed to disk. A good quick test is to check >> > "arrow::default_memory_pool()$bytes_allocated". This will report how >> > much memory Arrow believes it is using. If this is 0 then that is a >> > good bet (though by no means a guarantee) that anything the Arrow >> > system library has allocated has been released. In Windows you might >> > be able to use the program RAMMap [3] might give you some more >> > information on how much data is in the disk cache. >> > >> >> The current documentation for write_dataset says you can't rename >> while writing -- in my experience this did work. Is the reason for this >> note that in order to rename, Arrow will change the dataset to an in-memory >> Table? Based on my test, the memory usage didn't seem less, but this was >> one of my theories of what was going on. >> > >> > The note here is quite old and the functions it describes have been >> > changed a lot in the last year. My guess is this is a relic from the >> > time that dplyr functions were handled differently. Maybe someone >> > else can chime in to verify. From what I can tell a rename in R is >> > translated into a select which is translated into a project in the C++ >> > layer. A project operation should be able to operate in a streaming >> > fashion and will not force the memory to buffer. >> > >> > In summary, out of core processing is something I think many of us are >> > interested in and want to support. Basic out-of-core manipulation >> > (repartitioning, transforming from one format to another) should be >> > pretty well supported in 6.0.0 but it might consume all available OS >> > RAM as disk cache. Out of core work in general is still getting >> > started and you will hopefully continue to see improvements as we work >> > on them. For example, I hope future releases will be be able to >> > support out-of-core joins and ordering statements by spilling to disk. >> > >> > [1] https://issues.apache.org/jira/browse/ARROW-13611 >> > [2] https://docs.microsoft.com/en-us/windows/win32/fileio/file-caching >> > [3] https://docs.microsoft.com/en-us/sysinternals/downloads/rammap >> > >> > On Mon, Oct 18, 2021 at 11:34 AM Jameel Alsalam <[email protected]> >> wrote: >> >> >> >> Hello, >> >> >> >> I am a (learning) user of the Arrow R package on Windows. I am >> currently focused on using Arrow to do data preparation on >> bigger-than-my-memory set of csv files, transform them into parquet files, >> for further analysis with DuckDB. I have about 600 csv files, totaling >> about 200 GBs which had been dumped out of a database. I've had luck doing >> some of this, but for the biggest table I am struggling with understanding >> when Arrow may fill memory and grind to a halt, versus when I should expect >> that Arrow can iterate through. >> >> >> >> For reproducibility purposes, I did some working with the nyc-taxi >> dataset down below. These do not fill my memory, but they do use up more >> than I expected, and I don't know how to free it without restarting the R >> session. >> >> >> >> My questions: >> >> 1) When working with dplyr & datasets, are there parameters that >> determine whether operations can be performed in a streaming/iterative form >> that is needed when data is much bigger than memory? >> >> 2) I wasn't expecting write_dataset to continue consuming memory when >> finished. I don't think gc() or pryr functions are able to clear or measure >> memory used by Arrow. Are there different tools I should be using here? >> Maybe I need to be telling Arrow to limit usage somehow? >> >> 3) The current documentation for write_dataset says you can't rename >> while writing -- in my experience this did work. Is the reason for this >> note that in order to rename, Arrow will change the dataset to an in-memory >> Table? Based on my test, the memory usage didn't seem less, but this was >> one of my theories of what was going on. >> >> >> >> thanks, >> >> Jameel >> >> >> >> ``` >> >> #### Read dataset -> write dataset --------- >> >> >> >> library(tidyverse) >> >> library(arrow) >> >> library(duckdb) >> >> >> >> # Do I understand the limitations of out of memory dataset >> manipulations? >> >> >> >> packageVersion("arrow") >> >> # [1] ‘5.0.0.20211016’ >> >> >> >> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) >> >> >> >> # The documentation for write_dataset says you can't rename in the >> process of writing >> >> # In @param dataset: >> >> # "Note that select()-ed columns may not be renamed." >> >> >> >> ds %>% >> >> select(vendor_id, pickup_at, dropoff_at, year, month) %>% >> >> rename( >> >> pickup_dttm = pickup_at, >> >> dropoff_dttm = dropoff_at >> >> ) %>% >> >> write_dataset("nyc-taxi-mod", partitioning = c("year", "month")) >> >> >> >> # Starting memory usage: 420 MB (task manager - RStudio/R) >> >> # Ending memory usage: 12,100 MB (task manager - RStudio/R) >> >> >> >> # it does _work_, but a lot more memory is used. Task manager sees the >> memory as used by the >> >> # RStudio session, but Rstudio sees the memory as used by the system. >> I am assuming it is Arrow, >> >> # but I'm not sure how to control this, as e.g., there is no gc() for >> Arrow. >> >> >> >> # RESTART R SESSION HERE TO RECOVER MEMORY >> >> >> >> # Its possible that out of memory dataset operations can't use rename. >> >> >> >> # If you do not rename, and only select: >> >> ds %>% >> >> select(vendor_id, pickup_at, dropoff_at, year, month) %>% >> >> write_dataset("nyc-taxi-mod", partitioning = c("year", "month")) >> >> >> >> # starting memory usage: 425 MB (Task manager - for Rstudio/R) >> >> # end usage: 10,600 MB (task manager - for Rstudio/R) >> >> ``` >> >> >> >> >> >> >> >> -- >> >> Jameel Alsalam >> >> (510) 717-9637 >> >> [email protected] >> >> > > -- > Jameel Alsalam > (510) 717-9637 > [email protected] >
