[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
taotao li updated ARROW-7043: ----------------------------- Description: Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] his ambition on building arrow for fixing problems in pandas really attract my eyes. bellow is my problems: background: * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work. * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory. experiment: * luckily I find arrow, and I did a simple test. * input file: a 1.5GB csv file, around 6 million records, 15 columns; * using pandas bellow, which will consume about *1GB memory*, * {code:java} import pandas as pd df = pd.read_csv(filename){code} * using pyarrow bellow, which will consume about *3.6GB memory,* which really makes me confused. * {code:java} import pyarrow import pyarrow.csv table = pyarrow.csv.read_csv(filename){code} problems: * why pyarrow will need so much memory for reading just 1.5GB csv data, it really disappoints me. * and when pyarrow is reading the file, my 8 Core CPU is full used. environments: * ubuntu 16 * python 3.5, ipython 6.5 * pandas, 0.20 * pyarrow, 0.15 * server 8 core, 16 GB great thanks again. if needed, I can upload my 1.5GB file later. was: Hi, thanks great for building Arrow firstly, I find this project from wes's post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] his ambition on building arrow for fixing problems in pandas really attract my eyes. bellow is my problems: background: * Our team's analytic work deeply rely on pandas, we often read large csv files into memory and do kinds of analytic work. * We have faced problems which mentioned in wes's post, espcially `pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` * We are looking for some technics which can help us on load our csv(or other format, like msgpack, parquet, or something else), using as little as memory. experiment: * luckily I find arrow, and I did a simple test. * input file: a 1.5GB csv file, around 6 million records, 15 columns; * using pandas bellow: * {code:java} import pandas as pd{code} * * {code:java} {code} > pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv > -------------------------------------------------------------------------- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Test > Components: Python > Affects Versions: 0.15.0 > Reporter: taotao li > Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)