[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-12-18 Thread taotao li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999645#comment-16999645
 ] 

taotao li commented on ARROW-7043:
--

[~apitrou]  Hi, Antonie, sorry for updating this issue so late, currently it 
works good for me. let me just close this issue and re-open it if we still meet 
kinds of this problem in the future. thanks a lot. 

> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> Test File:
>  * 
> [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]
>  
>  
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-12-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999054#comment-16999054
 ] 

Antoine Pitrou commented on ARROW-7043:
---

[~taotao] Could you give an update here?

> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> Test File:
>  * 
> [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]
>  
>  
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-11-08 Thread taotao li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970704#comment-16970704
 ] 

taotao li commented on ARROW-7043:
--

thanks [~wesm] and [~apitrou] for your detailed test, let me try on 0.15.1 and 
update if there still exists some problems. will close this issue if everything 
works fine as I wanted.

 

thanks again.

> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> Test File:
>  * 
> [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]
>  
>  
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-11-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966785#comment-16966785
 ] 

Wes McKinney commented on ARROW-7043:
-

The jemalloc changes in 0.15.1 could be a factor here

> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> Test File:
>  * 
> [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]
>  
>  
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-11-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966682#comment-16966682
 ] 

Antoine Pitrou commented on ARROW-7043:
---

I've tried to reproduce using pyarrow 0.15.1, compiled from source (Ubuntu 
18.04, x86-64):

{code}
>>> import pandas as pd 
>>> 
>>> 
>>> %time df = pd.read_csv('../../data/2010_2019_stk.csv')  
>>> 
>>> 
CPU times: user 11.2 s, sys: 604 ms, total: 11.8 s
Wall time: 11.8 s
>>> import pandas as pd 
>>> 
>>> 
>>> %time df = pd.read_csv('../../data/2010_2019_stk.csv')  
>>> 
>>> 
CPU times: user 11.2 s, sys: 438 ms, total: 11.7 s
Wall time: 11.7 s
{code}
(RSS is around 1 GB)

{code}
>>> %time tab = csv.read_csv('../../data/2010_2019_stk.csv')
>>> 
>>> 
CPU times: user 17.3 s, sys: 1.7 s, total: 19 s
Wall time: 1.27 s
>>> %time tab = csv.read_csv('../../data/2010_2019_stk.csv')
>>> 
>>> 
CPU times: user 17.4 s, sys: 1.83 s, total: 19.2 s
Wall time: 1.33 s
{code}
(RSS is around 1.1 GB)

So with pyarrow 0.15.1 I'm not seeing much of a difference between Pandas and 
Arrow in terms of memory consumption. However, Arrow is much faster as it uses 
all CPU cores to parse and convert the CSV data.


> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> Test File:
>  * 
> [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]
>  
>  
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-11-01 Thread taotao li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965218#comment-16965218
 ] 

taotao li commented on ARROW-7043:
--

[~apitrou] thanks, Antonie, I've update a google driver link.

 
 * 
[https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]

> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> Test File:
>  * 
> [https://drive.google.com/file/d/1WmUd_NJUIES58bO8mDNd77nf2JVsEu_r/view?usp=sharing]
>  
>  
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7043) [Python] pyarrow.csv.read_csv, memory consumed much larger than raw pandas.read_csv

2019-11-01 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16965089#comment-16965089
 ] 

Antoine Pitrou commented on ARROW-7043:
---

Yes, can you upload the file?

> [Python] pyarrow.csv.read_csv, memory consumed much larger than raw 
> pandas.read_csv
> ---
>
> Key: ARROW-7043
> URL: https://issues.apache.org/jira/browse/ARROW-7043
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: taotao li
>Priority: Major
>
> Hi, thanks great for building Arrow firstly, I find this project from wes's 
> post : [https://wesmckinney.com/blog/apache-arrow-pandas-internals/]
> his ambition on building arrow for fixing problems in pandas really attract 
> my eyes.
> bellow is my problems:
> background:
>  * Our team's analytic work deeply rely on pandas, we often read large csv 
> files into memory and do kinds of analytic work.
>  * We have faced problems which mentioned in wes's post, espcially `pandas 
> rule of thumb: have 5 to 10 times as much RAM as the size of your dataset`
>  * We are looking for some technics which can help us on load our csv(or 
> other format, like msgpack, parquet, or something else), using as little as 
> memory.
>  
> experiment:
>  * luckily I find arrow, and I did a simple test.
>  * input file: a 1.5GB csv file, around 6 million records, 15 columns;
>  * using pandas bellow, which will consume about *1GB memory*,
>  * 
> {code:java}
> import pandas as pd
> df = pd.read_csv(filename){code}
>  * using pyarrow bellow, which will consume about *3.6GB memory,* which 
> really makes me confused.
>  * 
> {code:java}
> import pyarrow
> import pyarrow.csv
> table = pyarrow.csv.read_csv(filename){code}
>  
> problems:
>  * why pyarrow will need so much memory for reading just 1.5GB csv data, it 
> really disappoints me.
>  * and when pyarrow is reading the file, my 8 Core CPU is full used.
>  
> environments:
>  * ubuntu 16
>  * python 3.5, ipython 6.5
>  * pandas, 0.20
>  * pyarrow, 0.15
>  * server 8 core, 16 GB
>  
> great thanks again.
> if needed, I can upload my 1.5GB file later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)