Hi Wes,

Sorry for the late reply on this, but I think I got a reproducible test
case:

import pandas as pd
import numpy as np
import pyarrow as pa
from pyarrow import feather
import os
import psutil
pa.set_memory_pool(pa.system_memory_pool())
DATA_FILE = 'test.arrow'

def setup():
  np.random.seed(0)
  df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
columns=list([f'i_{i}' for i in range(57)]))
  df.to_feather(DATA_FILE)
  print(f'wrote {DATA_FILE}')
  import sys
  sys.exit()

if __name__ == "__main__":
  # setup()
  process = psutil.Process(os.getpid())
  path = DATA_FILE

  mem_size = process.memory_info().rss / 1e9
  print(f'BEFORE mem_size: {mem_size}gb')

  df = feather.read_feather(path)

  mem_size = process.memory_info().rss / 1e9
  df_size = df.memory_usage().sum() / 1e9
  print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
  print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')

OUTPUT:
BEFORE mem_size: 0.091795456gb
AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
ARROW: 3.281625024gb

Let me know if you're able to see similar results.

Thanks,
Arun

On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <weston.p...@gmail.com> wrote:

> I get more or less the same results as you for the provided setup data
> (exact same #'s for arrow & df_size and slightly different for RSS
> which is to be expected).  The fact that the arrow size is much lower
> than the dataframe size is not too surprising to me.  If a column
> can't be zero copied then it's memory will disappear from the arrow
> pool (I think).  Plus, object columns will have overhead in pandas
> that they do not have in Arrow.
>
> The df_size issue for me seems to be tied to string columns.  I think
> pandas is overestimating how much size is needed there (many of my
> strings are similar and I wonder if some kind of object sharing is
> happening).  But we can table this for another time.
>
> I tried writing my feather file with your parameters and it didn't
> have much impact on any of the numbers.
>
> Since the arrow size for you is expected (nearly the same as the
> df_size) I'm not sure what to investigate next.  The memory does not
> seem to be retained by Arrow.  Is there any chance you could create a
> reproducible test case using randomly generated numpy data (then you
> could share that setup function)?
>
> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <ajos...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > I'm not including the setup() call when I encounter the issue. I just
> kept it in there for ease of reproducibility. Memory usage is indeed higher
> when it is included, but that isn't surprising.
> >
> > I tried switching over to the system allocator but there is no change.
> >
> > I've updated to Arrow 6.0.1 as well and there is no change.
> >
> > I updated my script to also include the Arrow bytes allocated and it
> gave me the following:
> >
> > MVE:
> > import pandas as pd
> > import pyarrow as pa
> > from pyarrow import feather
> > import os
> > import psutil
> > pa.set_memory_pool(pa.system_memory_pool())
> >
> >
> > def setup():
> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >   df.to_feather('test.csv')
> >
> > if __name__ == "__main__":
> >   # setup()
> >   process = psutil.Process(os.getpid())
> >   path = 'test.csv'
> >
> >   mem_size = process.memory_info().rss / 1e9
> >   print(f'BEFORE mem_size: {mem_size}gb')
> >
> >   df = feather.read_feather(path)
> >
> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >   mem_size = process.memory_info().rss / 1e10
> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
> >
> > Output with my data:
> > BEFORE mem_size: 0.08761344gb
> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> > ARROW: 3.080121792gb
> >
> > Output with Provided Setup Data:
> > BEFORE mem_size: 0.09179136gb
> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> > ARROW: 0.00029664gb
> >
> > I'm assuming that the df and the arrow bytes allocated/sizes are
> distinct and non-overlapping, but it seems strange that the output with the
> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
> data has the allocated data approximately equal to the dataframe size. I'm
> not sure if it affects anything but my file was written with the following:
> >
> > import pyarrow.lib as ext
> > import pyarrow
> > COMPRESSION_LEVEL = 19
> > COMPRESSION_ALGO = 'zstd'
> > KILOBYTE = 1 << 10
> > MEGABYTE = KILOBYTE * KILOBYTE
> > CHUNK_SIZE = MEGABYTE
> >
> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >
> > As to the discrepancy around calculating dataframe size. I'm not sure
> why that would be so off for you. Going off the docs, it seems like it
> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
> where each column is mostly a float or integer and datetime index. 7196546
> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >
> > Thank You,
> > Arun
> >
> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <weston.p...@gmail.com>
> wrote:
> >>
> >> 2x overshoot of memory does seem a little high.  Are you including the
> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
> >> will require 2-3x memory usage because it buffers the bytes in memory
> >> in case it needs to re-convert them later (because it realizes the
> >> data type for the column is different).  I'm not sure if Panda's CSV
> >> reader is similar.
> >>
> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
> >> to a bit more memory and hold onto it (for a little while at least)
> >> even after it is no longer used.  Even malloc will hold onto memory
> >> sometimes due to fragmentation or other concerns.  You could try
> >> changing to the system allocator
> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
> >> to see if that makes a difference.
> >>
> >> I'm not sure your method of calculating the dataframe size is
> >> reliable.  I don't actually know enough about pandas but when I tried
> >> your experiment with my own 1.9G CSV file it ended up reporting:
> >>
> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >>
> >> which seems suspicious.
> >>
> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
> >> only using ~1.9GB and I will naively assume the difference between the
> >> two is bloat caused by object wrappers when converting to pandas.
> >>
> >> Another thing you might try and measure is
> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
> >> much memory Arrow itself is hanging onto.  If that is not 6GB then it
> >> is a pretty good guess that memory is being held somewhere else.
> >>
> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <ajos...@gmail.com> wrote:
> >> >
> >> > Hi Apache Arrow Members,
> >> >
> >> > My question is below but I've compiled a minimum reproducible example
> with a public dataset:
> >> >
> >> > import pandas as pd
> >> > from pyarrow import feather
> >> > import os
> >> > import psutil
> >> >
> >> >
> >> > def setup():
> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >   df.to_feather('test.csv')
> >> >
> >> > if __name__ == "__main__":
> >> >   # setup()
> >> >   process = psutil.Process(os.getpid())
> >> >   path = 'test.csv'
> >> >
> >> >   mem_size = process.memory_info().rss / 1e9
> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >
> >> >   df = feather.read_feather(path)
> >> >
> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >   mem_size = process.memory_info().rss / 1e9
> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >
> >> > I substituted my df with a sample csv. I had trouble finding a sample
> CSV of adequate size however, my dataset is ~3GB, and I see memory usage of
> close to 6GB.
> >> >
> >> > Output with My Data:
> >> > BEFORE mem_size: 0.088891392gb
> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >> >
> >> > It seems strange that the overall memory usage of the process is
> approx double of the size of the dataframe itself. Is there a reason for
> this, and is there a way to mitigate this?
> >> >
> >> > $ conda list pyarrow
> >> > #
> >> > # Name                    Version                   Build  Channel
> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
> conda-forge
> >> >
> >> > Thank You,
> >> > Arun Joseph
> >> >
> >
> >
> >
> > --
> > Arun Joseph
> >
>


-- 
Arun Joseph

Reply via email to