Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Arun Joseph Fri, 10 Dec 2021 08:28:11 -0800

Whoops sorry Weston, that’s my bad! Thank you for the addition to the SO
post. I do see some improvements with deleting the reader now.


Arun

On Thu, Dec 9, 2021 at 20:21 Weston Pace <weston.p...@gmail.com> wrote:

> > Thank you Wes and David for the in-depth responses.
>
> Just as an aside, I go by Weston, as there is already a Wes on this
> mailing list and it can get confusing ;)
>
> > I also created a stack overflow post...I hope that is ok and or useful.
> Otherwise I can remove it.
>
> I think that's fine, SO can have greater reach than the mailing list.
>
> > As for guppy double counting, that is really strange.
>
> I agree, maybe guppy has difficulty identifying when two python
> objects reference the same underlying chunk of C memory.  Here's a
> quick simple example of it getting confused (I'll add this to the SO
> post):
>
> import numpy as np
> import os
> import psutil
> import pyarrow as pa
> from guppy import hpy
>
> process = psutil.Process(os.getpid())
>
> x = np.random.rand(100000000)
> print(hpy().heap())
> print(process.memory_info().rss)
>
> # This is a zero-copy operation.  Note
> # that RSS remains consistent.  Both x
> # and arr reference the same underlying
> # array of doubles.
> arr = pa.array(x)
> print(hpy().heap())
> print(process.memory_info().rss)
>
> > By deleting the Reader, do you mean just doing a `del Reader` or `Reader
> = None`?
>
> I was thinking "del reader" but "reader = None" should achieve the
> same effect (I think?)
>
> On Tue, Dec 7, 2021 at 2:44 PM Arun Joseph <ajos...@gmail.com> wrote:
> >
> > Thank you Wes and David for the in-depth responses. I also created a
> stack overflow post on this with an example/outputs (before I saw your
> responses, but updated after I saw it), I hope that is ok and or useful.
> Otherwise I can remove it.
> >
> > # I'm pretty sure guppy3 is double-counting.
> > As for guppy double counting, that is really strange. I think you're
> looking at the cumulative size re: the dataframe size. Although, I did
> observe what you are describing when I was generating the test data as
> well. In the stack overflow post, I have another example which prints out
> the RSS and the guppy heap output and it does not seem like there is double
> counting for the normal run. I also included a sleep at the end before
> recording the heap and RSS as well.
> >
> > # I think split_blocks and self_destruct is the best answer at the
> moment.  self_destruct has remained in the code since at least 1.0.0 so
> perhaps it is time we remove the "experimental" flag and maybe replace it
> with a "caution" or "danger" flag (as it causes the table to become
> unusable afterwards). In terms of the closest immediate fix, split_blocks
> and self_destruct do seem like the best choices.
> > Yes I agree. I'll be incorporating these changes in my actual codebase.
> While they don't always work, there should be some improvement.
> >
> > # I did see some strange behavior when working with the
> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you can
> work around this by deleting the reader.
> > By deleting the Reader, do you mean just doing a `del Reader` or `Reader
> = None`?
> >
> > # Note that to minimize the memory usage, you should also pass
> use_threads=False
> > I will also try this out, thank you
> >
> > On Tue, Dec 7, 2021 at 6:32 PM David Li <lidav...@apache.org> wrote:
> >>
> >> Just for edification (though I have limited understanding of the
> machinery here, someone more familiar with Pandas internals may have more
> insight/this may be wrong or very outdated!):
> >>
> >> zero_copy_only does not work for two reasons (well, one reason
> fundamentally): the representation in memory of a Pandas dataframe has been
> a dense, 2D NumPy array per column type. In other words, all data across
> all columns of the same type are contiguous in memory. (At least
> historically. My understanding is that this has changed/become more
> flexible relatively recently.) This is the representation that Arrow tries
> to generate by default. (See
> https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.)
> >>
> >> However, the Arrow table you have is not contiguous: each column is
> allocated separately, and for a Table, each column is made up of a list of
> contiguous chunks. So there are very few cases where data can be
> zero-copied, it must instead be copied and "compacted".
> >>
> >> The split_blocks option *helps* work around this. It allows each column
> in the Pandas DataFrame to be its own allocation. However, each individual
> column must still be contiguous. If you try zero_copy_only with
> split_blocks, you'll get a different error message, this is because the
> columns of your Arrow Table have more than one chunk. If you create a small
> in-memory Table with only one column with one chunk, zero_copy_only +
> split_blocks will work!
> >>
> >> split_blocks with self_destruct works in this case still because
> self_destruct will still copy data, it will just also try to free the Arrow
> data as each column is converted. (Note that to minimize the memory usage,
> you should also pass use_threads=False. In that case, the maximum memory
> overhead should be one column's worth.)
> >>
> >> -David
> >>
> >> On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote:
> >>
> >> Thank you for the new example.
> >>
> >> # Why is it 2x?
> >>
> >> This is essentially a "peak RAM" usage of the operation.  Given that
> >> split_blocks helped I think we can attribute this doubling to the
> >> pandas conversion.
> >>
> >> # Why doesn't the memory get returned?
> >>
> >> It does, it just doesn't do so immediately.  If I put a 5 second sleep
> >> before I print the memory I see that the RSS shrinks down.  This is
> >> how jemalloc is configured in Arrow (actually I think it is 1 second)
> >> for releasing RSS after reaching peak consumption.
> >>
> >> BEFORE mem_size: 0.082276352gb
> >> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
> >> AFTER-ARROW: 3.281625024gb
> >> ---five second sleep---
> >> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
> >> AFTER-SLEEP-ARROW: 3.281625024gb
> >>
> >> # Why didn't switching to the system allocator help?
> >>
> >> The problem isn't "the dynamic allocator is allocating more than it
> >> needs".  There is a point in this process where ~6GB are actually
> >> needed.  The system allocator either also holds on to that RSS for a
> >> little bit or the RSS numbers themselves take a little bit of time to
> >> update.  I'm not entirely sure.
> >>
> >> # Why isn't this a zero-copy conversion to pandas?
> >>
> >> That's a good question, I don't know the details.  If I try manually
> >> doing the conversion with zero_copy_only I get the error "Cannot do
> >> zero copy conversion into multi-column DataFrame block"
> >>
> >> # What is up with the numpy.ndarray objects in the heap?
> >>
> >> I'm pretty sure guppy3 is double-counting.  Note that the total size
> >> is ~20GB.  I've been able to reproduce this in cases where the heap is
> >> 3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
> >> once even managed to generate this:
> >>
> >> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
> >> AFTER-SLEEP-ARROW: 0.0gb
> >> Partition of a set of 212560 objects. Total size = 13328742559 bytes.
> >> Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
> >>      0     57   0 6563250864  49 6563250864  49
> pandas.core.series.Series
> >>      1    133   0 3339213718  25 9902464582  74 numpy.ndarray
> >>      2      1   0 3339197360  25 13241661942  99
> pandas.core.frame.DataFrame
> >>
> >> The RSS is 3.44GB but guppy reports the dataframe as 13GB.
> >>
> >> I did see some strange behavior when working with the
> >> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
> >> can work around this by deleting the reader.
> >>
> >> # Can I return the data immediately / I don't want to use 2x memory
> consumption
> >>
> >> I think split_blocks and self_destruct is the best answer at the
> >> moment.  self_destruct has remained in the code since at least 1.0.0
> >> so perhaps it is time we remove the "experimental" flag and maybe
> >> replace it with a "caution" or "danger" flag (as it causes the table
> >> to become unusable afterwards).
> >>
> >> Jemalloc has some manual facilities to purge dirty memory and we
> >> expose some of them with
> >> pyarrow.default_memory_pool().release_unused() but that doesn't seem
> >> to be helping in this situation.  Either the excess memory is in the
> >> non-jemalloc pool or the jemalloc command can't quite release this
> >> memory, or the RSS stats are just stale.  I'm not entirely sure.
> >>
> >> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <ajos...@gmail.com> wrote:
> >> >
> >> > Slightly related, I have some other code that opens up an arrow file
> using a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch
> to a pandas dataframe. After this conversion is done, and I inspect the
> heap, I always see the following:
> >> >
> >> > hpy().heap()
> >> > Partition of a set of 351136 objects. Total size = 20112096840 bytes.
> >> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of
> class)
> >> >      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
> >> >      1      1   0 9939585700  49 19879186734  99
> pandas.core.frame.DataFrame
> >> >      2      1   0 185786680   1 20064973414 100
> pandas.core.indexes.datetimes.DatetimeIndex
> >> >
> >> > Specifically the numpy.ndarray. It only shows up after the conversion
> and it does not seem to go away. It also seems to be roughly the same size
> as the dataframe itself.
> >> >
> >> > - Arun
> >> >
> >> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <ajos...@gmail.com>
> wrote:
> >> >>
> >> >> Just to follow up on this, is there a way to manually force the
> arrow pool to de-allocate? My usecase is essentially having multiple
> processes in a Pool or via Slurm read from an arrow file, do some work, and
> then exit. Issue is that the 2x memory consumption reduces the bandwidth on
> the machine to effectively half.
> >> >>
> >> >> Thank You,
> >> >> Arun
> >> >>
> >> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <ajos...@gmail.com>
> wrote:
> >> >>>
> >> >>> Additionally, I tested with my actual data, and did not see memory
> savings.
> >> >>>
> >> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <ajos...@gmail.com>
> wrote:
> >> >>>>
> >> >>>> Hi Joris,
> >> >>>>
> >> >>>> Thank you for the explanation. The 2x memory consumption on
> conversion makes sense if there is a copy, but it does seem like it
> persists longer than it should. Might be because of python's GC policies?
> >> >>>> I tried out your recommendations but they did not seem to work.
> However, I did notice an experimental option on `to_pandas`,
> `self_destruct`, which seems to address the issue I'm facing. Sadly, that
> itself did not work either... but, combined with `split_blocks=True`, I am
> seeing memory savings:
> >> >>>>
> >> >>>> import pandas as pd
> >> >>>> import numpy as np
> >> >>>> import pyarrow as pa
> >> >>>> from pyarrow import feather
> >> >>>> import os
> >> >>>> import psutil
> >> >>>> pa.set_memory_pool(pa.system_memory_pool())
> >> >>>> DATA_FILE = 'test.arrow'
> >> >>>>
> >> >>>> def setup():
> >> >>>>   np.random.seed(0)
> >> >>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
> columns=list([f'{i}' for i in range(57)]))
> >> >>>>   df.to_feather(DATA_FILE)
> >> >>>>   print(f'wrote {DATA_FILE}')
> >> >>>>   import sys
> >> >>>>   sys.exit()
> >> >>>>
> >> >>>> if __name__ == "__main__":
> >> >>>>   # setup()
> >> >>>>   process = psutil.Process(os.getpid())
> >> >>>>   path = DATA_FILE
> >> >>>>
> >> >>>>   mem_size = process.memory_info().rss / 1e9
> >> >>>>   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>
> >> >>>>   feather_table = feather.read_table(path)
> >> >>>>   # df = feather_table.to_pandas(split_blocks=True)
> >> >>>>   # df = feather_table.to_pandas()
> >> >>>>   df = feather_table.to_pandas(self_destruct=True,
> split_blocks=True)
> >> >>>>
> >> >>>>   mem_size = process.memory_info().rss / 1e9
> >> >>>>   df_size = df.memory_usage().sum() / 1e9
> >> >>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >> >>>>
> >> >>>>
> >> >>>> OUTPUT(to_pandas()):
> >> >>>> BEFORE mem_size: 0.091795456gb
> >> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> >> >>>> ARROW: 3.281625024gb
> >> >>>>
> >> >>>> OUTPUT (to_pandas(split_blocks=True)):
> >> >>>> BEFORE mem_size: 0.091795456gb
> >> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> >> >>>> ARROW: 3.281627712gb
> >> >>>>
> >> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> >> >>>> BEFORE mem_size: 0.091795456gb
> >> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> >> >>>> ARROW: 3.281627712gb
> >> >>>>
> >> >>>> I'm guessing since this feature is experimental, it might either
> go away, or might have strange behaviors. Is there anything I should look
> out for, or is there some alternative to reproduce these results?
> >> >>>>
> >> >>>> Thank You,
> >> >>>> Arun
> >> >>>>
> >> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
> >> >>>>>
> >> >>>>> Hi Aron, Weston,
> >> >>>>>
> >> >>>>> I didn't try running the script locally, but a quick note: the
> >> >>>>> `feather.read_feather` function reads the Feather file into an
> Arrow
> >> >>>>> table ànd directly converts it to a pandas DataFrame. A memory
> >> >>>>> consumption 2x the size of the dataframe sounds not that
> unexpected to
> >> >>>>> me: most of the time, when converting an arrow table to a pandas
> >> >>>>> DataFrame, the data will be copied to accommodate for pandas'
> specific
> >> >>>>> internal memory layout (at least numeric columns will be combined
> >> >>>>> together in 2D arrays).
> >> >>>>>
> >> >>>>> To verify if this is the cause, you might want to do either of:
> >> >>>>> - use `feather.read_table` instead of `feather.read_feather`,
> which
> >> >>>>> will read the file as an Arrow table instead (and don't do any
> >> >>>>> conversion to pandas)
> >> >>>>> - if you want to include the conversion to pandas, also use
> >> >>>>> `read_table` and do the conversion to pandas explicitly with a
> >> >>>>> `to_pandas()` call on the result. In that case, you can specify
> >> >>>>> `split_blocks=True` to use more zero-copy conversion in the
> >> >>>>> arrow->pandas conversion
> >> >>>>>
> >> >>>>> Joris
> >> >>>>>
> >> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <ajos...@gmail.com>
> wrote:
> >> >>>>> >
> >> >>>>> > Hi Wes,
> >> >>>>> >
> >> >>>>> > Sorry for the late reply on this, but I think I got a
> reproducible test case:
> >> >>>>> >
> >> >>>>> > import pandas as pd
> >> >>>>> > import numpy as np
> >> >>>>> > import pyarrow as pa
> >> >>>>> > from pyarrow import feather
> >> >>>>> > import os
> >> >>>>> > import psutil
> >> >>>>> > pa.set_memory_pool(pa.system_memory_pool())
> >> >>>>> > DATA_FILE = 'test.arrow'
> >> >>>>> >
> >> >>>>> > def setup():
> >> >>>>> >   np.random.seed(0)
> >> >>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546,
> 57)), columns=list([f'i_{i}' for i in range(57)]))
> >> >>>>> >   df.to_feather(DATA_FILE)
> >> >>>>> >   print(f'wrote {DATA_FILE}')
> >> >>>>> >   import sys
> >> >>>>> >   sys.exit()
> >> >>>>> >
> >> >>>>> > if __name__ == "__main__":
> >> >>>>> >   # setup()
> >> >>>>> >   process = psutil.Process(os.getpid())
> >> >>>>> >   path = DATA_FILE
> >> >>>>> >
> >> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>> >
> >> >>>>> >   df = feather.read_feather(path)
> >> >>>>> >
> >> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >   df_size = df.memory_usage().sum() / 1e9
> >> >>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >> >>>>> >
> >> >>>>> > OUTPUT:
> >> >>>>> > BEFORE mem_size: 0.091795456gb
> >> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> >> >>>>> > ARROW: 3.281625024gb
> >> >>>>> >
> >> >>>>> > Let me know if you're able to see similar results.
> >> >>>>> >
> >> >>>>> > Thanks,
> >> >>>>> > Arun
> >> >>>>> >
> >> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <
> weston.p...@gmail.com> wrote:
> >> >>>>> >>
> >> >>>>> >> I get more or less the same results as you for the provided
> setup data
> >> >>>>> >> (exact same #'s for arrow & df_size and slightly different for
> RSS
> >> >>>>> >> which is to be expected).  The fact that the arrow size is
> much lower
> >> >>>>> >> than the dataframe size is not too surprising to me.  If a
> column
> >> >>>>> >> can't be zero copied then it's memory will disappear from the
> arrow
> >> >>>>> >> pool (I think).  Plus, object columns will have overhead in
> pandas
> >> >>>>> >> that they do not have in Arrow.
> >> >>>>> >>
> >> >>>>> >> The df_size issue for me seems to be tied to string columns.
> I think
> >> >>>>> >> pandas is overestimating how much size is needed there (many
> of my
> >> >>>>> >> strings are similar and I wonder if some kind of object
> sharing is
> >> >>>>> >> happening).  But we can table this for another time.
> >> >>>>> >>
> >> >>>>> >> I tried writing my feather file with your parameters and it
> didn't
> >> >>>>> >> have much impact on any of the numbers.
> >> >>>>> >>
> >> >>>>> >> Since the arrow size for you is expected (nearly the same as
> the
> >> >>>>> >> df_size) I'm not sure what to investigate next.  The memory
> does not
> >> >>>>> >> seem to be retained by Arrow.  Is there any chance you could
> create a
> >> >>>>> >> reproducible test case using randomly generated numpy data
> (then you
> >> >>>>> >> could share that setup function)?
> >> >>>>> >>
> >> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <ajos...@gmail.com>
> wrote:
> >> >>>>> >> >
> >> >>>>> >> > Hi Wes,
> >> >>>>> >> >
> >> >>>>> >> > I'm not including the setup() call when I encounter the
> issue. I just kept it in there for ease of reproducibility. Memory usage is
> indeed higher when it is included, but that isn't surprising.
> >> >>>>> >> >
> >> >>>>> >> > I tried switching over to the system allocator but there is
> no change.
> >> >>>>> >> >
> >> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
> >> >>>>> >> >
> >> >>>>> >> > I updated my script to also include the Arrow bytes
> allocated and it gave me the following:
> >> >>>>> >> >
> >> >>>>> >> > MVE:
> >> >>>>> >> > import pandas as pd
> >> >>>>> >> > import pyarrow as pa
> >> >>>>> >> > from pyarrow import feather
> >> >>>>> >> > import os
> >> >>>>> >> > import psutil
> >> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
> >> >>>>> >> >
> >> >>>>> >> >
> >> >>>>> >> > def setup():
> >> >>>>> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >>>>> >> >   df.to_feather('test.csv')
> >> >>>>> >> >
> >> >>>>> >> > if __name__ == "__main__":
> >> >>>>> >> >   # setup()
> >> >>>>> >> >   process = psutil.Process(os.getpid())
> >> >>>>> >> >   path = 'test.csv'
> >> >>>>> >> >
> >> >>>>> >> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>> >> >
> >> >>>>> >> >   df = feather.read_feather(path)
> >> >>>>> >> >
> >> >>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >>>>> >> >   mem_size = process.memory_info().rss / 1e10
> >> >>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated()
> / 1e9}gb')
> >> >>>>> >> >
> >> >>>>> >> > Output with my data:
> >> >>>>> >> > BEFORE mem_size: 0.08761344gb
> >> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> >> >>>>> >> > ARROW: 3.080121792gb
> >> >>>>> >> >
> >> >>>>> >> > Output with Provided Setup Data:
> >> >>>>> >> > BEFORE mem_size: 0.09179136gb
> >> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> >> >>>>> >> > ARROW: 0.00029664gb
> >> >>>>> >> >
> >> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes
> are distinct and non-overlapping, but it seems strange that the output with
> the provided data has the Arrow bytes allocated at ~0GB whereas the one
> with my data has the allocated data approximately equal to the dataframe
> size. I'm not sure if it affects anything but my file was written with the
> following:
> >> >>>>> >> >
> >> >>>>> >> > import pyarrow.lib as ext
> >> >>>>> >> > import pyarrow
> >> >>>>> >> > COMPRESSION_LEVEL = 19
> >> >>>>> >> > COMPRESSION_ALGO = 'zstd'
> >> >>>>> >> > KILOBYTE = 1 << 10
> >> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
> >> >>>>> >> > CHUNK_SIZE = MEGABYTE
> >> >>>>> >> >
> >> >>>>> >> > table = pyarrow.Table.from_pandas(df,
> preserve_index=preserve_index)
> >> >>>>> >> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >> >>>>> >> >
> >> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm
> not sure why that would be so off for you. Going off the docs, it seems
> like it should be accurate. My Dataframe in question is [7196546 rows x 56
> columns] where each column is mostly a float or integer and datetime index.
> 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >> >>>>> >> >
> >> >>>>> >> > Thank You,
> >> >>>>> >> > Arun
> >> >>>>> >> >
> >> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <
> weston.p...@gmail.com> wrote:
> >> >>>>> >> >>
> >> >>>>> >> >> 2x overshoot of memory does seem a little high.  Are you
> including the
> >> >>>>> >> >> "setup" part when you encounter that?  Arrow's file-based
> CSV reader
> >> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes
> in memory
> >> >>>>> >> >> in case it needs to re-convert them later (because it
> realizes the
> >> >>>>> >> >> data type for the column is different).  I'm not sure if
> Panda's CSV
> >> >>>>> >> >> reader is similar.
> >> >>>>> >> >>
> >> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow
> to hold on
> >> >>>>> >> >> to a bit more memory and hold onto it (for a little while
> at least)
> >> >>>>> >> >> even after it is no longer used.  Even malloc will hold
> onto memory
> >> >>>>> >> >> sometimes due to fragmentation or other concerns.  You
> could try
> >> >>>>> >> >> changing to the system allocator
> >> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of
> your file)
> >> >>>>> >> >> to see if that makes a difference.
> >> >>>>> >> >>
> >> >>>>> >> >> I'm not sure your method of calculating the dataframe size
> is
> >> >>>>> >> >> reliable.  I don't actually know enough about pandas but
> when I tried
> >> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up
> reporting:
> >> >>>>> >> >>
> >> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >> >>>>> >> >>
> >> >>>>> >> >> which seems suspicious.
> >> >>>>> >> >>
> >> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1)
> didn't seem
> >> >>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow
> itself was
> >> >>>>> >> >> only using ~1.9GB and I will naively assume the difference
> between the
> >> >>>>> >> >> two is bloat caused by object wrappers when converting to
> pandas.
> >> >>>>> >> >>
> >> >>>>> >> >> Another thing you might try and measure is
> >> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will
> tell you how
> >> >>>>> >> >> much memory Arrow itself is hanging onto.  If that is not
> 6GB then it
> >> >>>>> >> >> is a pretty good guess that memory is being held somewhere
> else.
> >> >>>>> >> >>
> >> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <
> ajos...@gmail.com> wrote:
> >> >>>>> >> >> >
> >> >>>>> >> >> > Hi Apache Arrow Members,
> >> >>>>> >> >> >
> >> >>>>> >> >> > My question is below but I've compiled a minimum
> reproducible example with a public dataset:
> >> >>>>> >> >> >
> >> >>>>> >> >> > import pandas as pd
> >> >>>>> >> >> > from pyarrow import feather
> >> >>>>> >> >> > import os
> >> >>>>> >> >> > import psutil
> >> >>>>> >> >> >
> >> >>>>> >> >> >
> >> >>>>> >> >> > def setup():
> >> >>>>> >> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >>>>> >> >> >   df.to_feather('test.csv')
> >> >>>>> >> >> >
> >> >>>>> >> >> > if __name__ == "__main__":
> >> >>>>> >> >> >   # setup()
> >> >>>>> >> >> >   process = psutil.Process(os.getpid())
> >> >>>>> >> >> >   path = 'test.csv'
> >> >>>>> >> >> >
> >> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>> >> >> >
> >> >>>>> >> >> >   df = feather.read_feather(path)
> >> >>>>> >> >> >
> >> >>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size:
> {df_size}gb')
> >> >>>>> >> >> >
> >> >>>>> >> >> > I substituted my df with a sample csv. I had trouble
> finding a sample CSV of adequate size however, my dataset is ~3GB, and I
> see memory usage of close to 6GB.
> >> >>>>> >> >> >
> >> >>>>> >> >> > Output with My Data:
> >> >>>>> >> >> > BEFORE mem_size: 0.088891392gb
> >> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >> >>>>> >> >> >
> >> >>>>> >> >> > It seems strange that the overall memory usage of the
> process is approx double of the size of the dataframe itself. Is there a
> reason for this, and is there a way to mitigate this?
> >> >>>>> >> >> >
> >> >>>>> >> >> > $ conda list pyarrow
> >> >>>>> >> >> > #
> >> >>>>> >> >> > # Name                    Version
>  Build  Channel
> >> >>>>> >> >> > pyarrow                   4.0.1
>  py37h0f64622_13_cpu    conda-forge
> >> >>>>> >> >> >
> >> >>>>> >> >> > Thank You,
> >> >>>>> >> >> > Arun Joseph
> >> >>>>> >> >> >
> >> >>>>> >> >
> >> >>>>> >> >
> >> >>>>> >> >
> >> >>>>> >> > --
> >> >>>>> >> > Arun Joseph
> >> >>>>> >> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > --
> >> >>>>> > Arun Joseph
> >> >>>>> >
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Arun Joseph
> >> >>>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Arun Joseph
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Arun Joseph
> >> >>
> >> >
> >> >
> >> > --
> >> > Arun Joseph
> >> >
> >>
> >>
> >
> >
> > --
> > Arun Joseph
> >
>
-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Reply via email to