[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-28 Thread Casey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960896#comment-16960896
 ] 

Casey commented on ARROW-6985:
--

So it sounds like this is just a known use case where parquet is not well 
suited. For my own knowledge, why exactly is the heap fragmenting? Shouldn't 
the heap allocation just grab the same memory that was used in the previous 
iteration?

 

Anyway, happy to have the issue closed as not needed and I'll restructure our 
data to work within these limitations.

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959779#comment-16959779
 ] 

Casey commented on ARROW-6985:
--

Okay looks like the wide matrix case was explained in the ticket you linked. As 
for the loop slowdown I'm seeing it gradually increase over time depending on 
the data's shape.

Below are my results for a wide matrix, the wide matrix transposed, and the 
matrix unraveled as a column. On the number of loops I tried this with, I see 
about 2x in the wide and wide transpose cases though the trend of the line 
indicates it will continue growing. Is this expected?

!image-2019-10-25-14-52-46-165.png|width=479,height=273!

!image-2019-10-25-14-53-37-623.png|width=483,height=253!

!image-2019-10-25-14-54-32-583.png|width=462,height=246!

 

 

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Casey updated ARROW-6985:
-
Attachment: image-2019-10-25-14-54-32-583.png

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Casey updated ARROW-6985:
-
Attachment: image-2019-10-25-14-53-37-623.png

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Casey updated ARROW-6985:
-
Attachment: image-2019-10-25-14-52-46-165.png

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6985) Steadily increasing time to load file using read_parquet

2019-10-24 Thread Casey (Jira)
Casey created ARROW-6985:


 Summary: Steadily increasing time to load file using read_parquet
 Key: ARROW-6985
 URL: https://issues.apache.org/jira/browse/ARROW-6985
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.15.0, 0.14.0, 0.13.0
Reporter: Casey
 Fix For: 0.15.0, 0.14.0, 0.13.0


I've noticed that reading from parquet using pandas read_parquet function is 
taking steadily longer with each invocation. I've seen the other ticket about 
memory usage but I'm seeing no memory impact just steadily increasing read time 
until I restart the python session.

Below is some code to reproduce my results. I notice it's particularly bad on 
wide matrices, especially using pyarrow==0.15.0
{code:python}
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import os
import numpy as np
import time

file = "skinny_matrix.pq"

if not os.path.isfile(file):
mat = np.zeros((6000, 26000))
mat.ravel()[::100] = np.random.randn(60 * 26000)
df = pd.DataFrame(mat.T)
table = pa.Table.from_pandas(df)
pq.write_table(table, file)

n_timings = 50
timings = np.empty(n_timings)
for i in range(n_timings):
start = time.time()
new_df = pd.read_parquet(file)
end = time.time()
timings[i] = end - start
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)