[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999645#comment-16999645 ] taotao li commented on ARROW-7043: -- [~apitrou] Hi, Antonie, sorry for updating this issue so late, currently it works good for me. let me just close this issue and re-open it if we still meet kinds of this problem in the future. thanks a lot. > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > Test File: > * > [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > > > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999054#comment-16999054 ] Antoine Pitrou commented on ARROW-7043: --- [~taotao] Could you give an update here? > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > Test File: > * > [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > > > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970704#comment-16970704 ] taotao li commented on ARROW-7043: -- thanks [~wesm] and [~apitrou] for your detailed test, let me try on 0.15.1 and update if there still exists some problems. will close this issue if everything works fine as I wanted. thanks again. > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > Test File: > * > [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > > > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966785#comment-16966785 ] Wes McKinney commented on ARROW-7043: - The jemalloc changes in 0.15.1 could be a factor here > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > Test File: > * > [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > > > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966682#comment-16966682 ] Antoine Pitrou commented on ARROW-7043: --- I've tried to reproduce using pyarrow 0.15.1, compiled from source (Ubuntu 18.04, x86-64): {code} >>> import pandas as pd >>> >>> >>> %time df = pd.read_csv('../../data/2010_2019_stk.csv') >>> >>> CPU times: user 11.2 s, sys: 604 ms, total: 11.8 s Wall time: 11.8 s >>> import pandas as pd >>> >>> >>> %time df = pd.read_csv('../../data/2010_2019_stk.csv') >>> >>> CPU times: user 11.2 s, sys: 438 ms, total: 11.7 s Wall time: 11.7 s {code} (RSS is around 1 GB) {code} >>> %time tab = csv.read_csv('../../data/2010_2019_stk.csv') >>> >>> CPU times: user 17.3 s, sys: 1.7 s, total: 19 s Wall time: 1.27 s >>> %time tab = csv.read_csv('../../data/2010_2019_stk.csv') >>> >>> CPU times: user 17.4 s, sys: 1.83 s, total: 19.2 s Wall time: 1.33 s {code} (RSS is around 1.1 GB) So with pyarrow 0.15.1 I'm not seeing much of a difference between Pandas and Arrow in terms of memory consumption. However, Arrow is much faster as it uses all CPU cores to parse and convert the CSV data. > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > Test File: > * > [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > > > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965218#comment-16965218 ] taotao li commented on ARROW-7043: -- [~apitrou] thanks, Antonie, I've update a google driver link. * [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > Test File: > * > [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing] > > > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv
[ https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965089#comment-16965089 ] Antoine Pitrou commented on ARROW-7043: --- Yes, can you upload the file? > [Python] pyarrow.csv.read_csv, memory consumed much larger than raw > pandas.read_csv > --- > > Key: ARROW-7043 > URL: https://issues.apache.org/jira/browse/ARROW-7043 > Project: Apache Arrow > Issue Type: Test > Components: Python >Affects Versions: 0.15.0 >Reporter: taotao li >Priority: Major > > Hi, thanks great for building Arrow firstly, I find this project from wes's > post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/] > his ambition on building arrow for fixing problems in pandas really attract > my eyes. > bellow is my problems: > background: > * Our team's analytic work deeply rely on pandas, we often read large csv > files into memory and do kinds of analytic work. > * We have faced problems which mentioned in wes's post, espcially `pandas > rule of thumb: have 5 to 10 times as much RAM as the size of your dataset` > * We are looking for some technics which can help us on load our csv(or > other format, like msgpack, parquet, or something else), using as little as > memory. > > experiment: > * luckily I find arrow, and I did a simple test. > * input file: a 1.5GB csv file, around 6 million records, 15 columns; > * using pandas bellow, which will consume about *1GB memory*, > * > {code:java} > import pandas as pd > df = pd.read_csv(filename){code} > * using pyarrow bellow, which will consume about *3.6GB memory,* which > really makes me confused. > * > {code:java} > import pyarrow > import pyarrow.csv > table = pyarrow.csv.read_csv(filename){code} > > problems: > * why pyarrow will need so much memory for reading just 1.5GB csv data, it > really disappoints me. > * and when pyarrow is reading the file, my 8 Core CPU is full used. > > environments: > * ubuntu 16 > * python 3.5, ipython 6.5 > * pandas, 0.20 > * pyarrow, 0.15 > * server 8 core, 16 GB > > great thanks again. > if needed, I can upload my 1.5GB file later. -- This message was sent by Atlassian Jira (v8.3.4#803005)