Weston: Thank you for the detailed reply. This gives me a much better sense
as to where to keep looking, and where behavior I'm seeing could be a bug.
I have tried `arrow::default_memory_pool()$bytes_allocated` and I get 1048
both before and after write_dataset, which seems to confirm that it is not
arrow memory allocation. I will keep investigating, and this hint has
given me a clue that more parts of the C++ functions are accessible from R
than I realized. And I will certainly follow continuing progress in this
area.

Benson: Thanks for this idea. A bigger machine is certainly a possibility,
though  for the time being I need to make do.

best,
Jameel

On Tue, Oct 19, 2021 at 2:03 AM Benson Muite <[email protected]>
wrote:

> Might you be able to run your analysis in batch scripts[1] rather than
> using RStudio? If so, this might make it easier to run in the cloud
> where large shared memory compute instances (upto 350GB) are available.
> The script you have for the NYC taxi dataset should run using batch
> processing.
>
> Using batch scripts might also make it easier to use R using distributed
> memory.
>
> [1] https://rdrr.io/r/utils/BATCH.html
>
> On 10/19/21 8:13 AM, Weston Pace wrote:
> > Thank you for your interest!  As you are discovering it can be a
> > slightly tricky topic.  Disclaimer: I am neither an R expert or a
> > Windows expert but I have worked some on the underlying C++ feature.
> >
> > Some of the limitations you are running into are probably bugs and not
> > intentional limitations.  In particular, there was a regression in
> > 4.0.0 and 5.0.0 [1] which meant that a slow consumer of data (writing
> > a dataset is typically a slow consumer) could lead to Arrow using too
> > much RAM (in such a case it needs to backoff on the read and it was
> > not doing this) and eventually crashing out of memory.  This issue has
> > been recently fixed and, using a very recent build, I confirmed that I
> > was able to read a 30GB CSV dataset and write it to parquet.  The peak
> > RAM usage of the Arrow process was 1GB.  Note, this was on Linux and
> > not Windows but this shouldn't matter.
> >
> > Some of this may be due to the way that the OS handles disk I/O.  As
> > users, we are typically used to smallish disk writes (e.g. a couple of
> > GB or less) and, from a user perspective, these writes are often
> > non-blocking and very fast.  In reality the write is only pushing the
> > data into the OS' disk cache and then returning as soon as that memcpy
> > is done.  The actual write to the physical disk happens behind the
> > scenes (and can even happen outside the lifetime of the process).  By
> > default, this disk cache is (I think, not sure for Windows) allowed to
> > consume all available RAM.  Once that happens additional writes (and
> > possible regular allocations) will be slowed down while the OS waits
> > for the data to be persisted to disk so that the RAM can be used.
> >
> > At the moment, Arrow does nothing to prevent this OS cache from
> > filling up.  This may be something we can investigate in future
> > releases, it is an interesting question what the best behavior is.
> >
> >> When working with dplyr & datasets, are there parameters that determine
> whether operations can be performed in a streaming/iterative form that is
> needed when data is much bigger than memory?
> >
> > There are a few operations which are not implemented by Arrow and I
> > believe when this situation is encountered Arrow will load the entire
> > dataset into memory and apply the operation in R.  This would
> > definitely be a bad thing for your goals.  I'm not sure the exact
> > details of what operations will trigger this but the basic select /
> > rename operations you have should be ok.
> >
> > There are also a few operations which are implemented in arrow but
> > will force the data to be buffered in memory.  These are arrange (or
> > anything ordering data like top-k) and join.
> >
> >> I wasn't expecting write_dataset to continue consuming memory when
> finished. I don't think gc() or pryr functions are able to clear or measure
> memory used by Arrow. Are there different tools I should be using here?
> Maybe I need to be telling Arrow to limit usage somehow?
> >
> > After write_dataset is finished the process should not be holding on
> > to any memory.  If it is doing so then that is a bug.  However, the OS
> > may still be holding onto data that is in the disk cache waiting to be
> > flushed to disk.  A good quick test is to check
> > "arrow::default_memory_pool()$bytes_allocated".  This will report how
> > much memory Arrow believes it is using.  If this is 0 then that is a
> > good bet (though by no means a guarantee) that anything the Arrow
> > system library has allocated has been released.  In Windows you might
> > be able to use the program RAMMap [3] might give you some more
> > information on how much data is in the disk cache.
> >
> >> The current documentation for write_dataset says you can't rename while
> writing -- in my experience this did work. Is the reason for this note that
> in order to rename, Arrow will change the dataset to an in-memory Table?
> Based on my test, the memory usage didn't seem less, but this was one of my
> theories of what was going on.
> >
> > The note here is quite old and the functions it describes have been
> > changed a lot in the last year.  My guess is this is a relic from the
> > time that dplyr functions were handled differently.  Maybe someone
> > else can chime in to verify.  From what I can tell a rename in R is
> > translated into a select which is translated into a project in the C++
> > layer.  A project operation should be able to operate in a streaming
> > fashion and will not force the memory to buffer.
> >
> > In summary, out of core processing is something I think many of us are
> > interested in and want to support.  Basic out-of-core manipulation
> > (repartitioning, transforming from one format to another) should be
> > pretty well supported in 6.0.0 but it might consume all available OS
> > RAM as disk cache.  Out of core work in general is still getting
> > started and you will hopefully continue to see improvements as we work
> > on them.  For example, I hope future releases will be be able to
> > support out-of-core joins and ordering statements by spilling to disk.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-13611
> > [2] https://docs.microsoft.com/en-us/windows/win32/fileio/file-caching
> > [3] https://docs.microsoft.com/en-us/sysinternals/downloads/rammap
> >
> > On Mon, Oct 18, 2021 at 11:34 AM Jameel Alsalam <[email protected]>
> wrote:
> >>
> >> Hello,
> >>
> >> I am a (learning) user of the Arrow R package on Windows. I am
> currently focused on using Arrow to do data preparation on
> bigger-than-my-memory set of csv files, transform them into parquet files,
> for further analysis with DuckDB. I have about 600 csv files, totaling
> about 200 GBs which had been dumped out of a database. I've had luck doing
> some of this, but for the biggest table I am struggling with understanding
> when Arrow may fill memory and grind to a halt, versus when I should expect
> that Arrow can iterate through.
> >>
> >> For reproducibility purposes, I did some working with the nyc-taxi
> dataset down below. These do not fill my memory, but they do use up more
> than I expected, and I don't know how to free it without restarting the R
> session.
> >>
> >> My questions:
> >> 1) When working with dplyr & datasets, are there parameters that
> determine whether operations can be performed in a streaming/iterative form
> that is needed when data is much bigger than memory?
> >> 2) I wasn't expecting write_dataset to continue consuming memory when
> finished. I don't think gc() or pryr functions are able to clear or measure
> memory used by Arrow. Are there different tools I should be using here?
> Maybe I need to be telling Arrow to limit usage somehow?
> >> 3) The current documentation for write_dataset says you can't rename
> while writing -- in my experience this did work. Is the reason for this
> note that in order to rename, Arrow will change the dataset to an in-memory
> Table? Based on my test, the memory usage didn't seem less, but this was
> one of my theories of what was going on.
> >>
> >> thanks,
> >> Jameel
> >>
> >> ```
> >> #### Read dataset -> write dataset ---------
> >>
> >> library(tidyverse)
> >> library(arrow)
> >> library(duckdb)
> >>
> >> # Do I understand the limitations of out of memory dataset
> manipulations?
> >>
> >> packageVersion("arrow")
> >> # [1] ‘5.0.0.20211016’
> >>
> >> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> >>
> >> # The documentation for write_dataset says you can't rename in the
> process of writing
> >> # In @param dataset:
> >> # "Note that select()-ed columns may not be renamed."
> >>
> >> ds %>%
> >>    select(vendor_id, pickup_at, dropoff_at, year, month) %>%
> >>    rename(
> >>      pickup_dttm = pickup_at,
> >>      dropoff_dttm = dropoff_at
> >>    ) %>%
> >>    write_dataset("nyc-taxi-mod", partitioning = c("year", "month"))
> >>
> >> # Starting memory usage: 420 MB (task manager - RStudio/R)
> >> # Ending memory usage: 12,100 MB (task manager - RStudio/R)
> >>
> >> # it does _work_, but a lot more memory is used. Task manager sees the
> memory as used by the
> >> # RStudio session, but Rstudio sees the memory as used by the system. I
> am assuming it is Arrow,
> >> # but I'm not sure how to control this, as e.g., there is no gc() for
> Arrow.
> >>
> >> # RESTART R SESSION HERE TO RECOVER MEMORY
> >>
> >> # Its possible that out of memory dataset operations can't use rename.
> >>
> >> # If you do not rename, and only select:
> >> ds %>%
> >>    select(vendor_id, pickup_at, dropoff_at, year, month) %>%
> >>    write_dataset("nyc-taxi-mod", partitioning = c("year", "month"))
> >>
> >> # starting memory usage: 425 MB (Task manager - for Rstudio/R)
> >> # end usage: 10,600 MB (task manager - for Rstudio/R)
> >> ```
> >>
> >>
> >>
> >> --
> >> Jameel Alsalam
> >> (510) 717-9637
> >> [email protected]
>
>

-- 
Jameel Alsalam
(510) 717-9637
[email protected]

Reply via email to