[jira] [Commented] (ARROW-7690) [R] Cannot write parquet to OutputStream
[ https://issues.apache.org/jira/browse/ARROW-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024470#comment-17024470 ] Bob commented on ARROW-7690: Yeah, I can submit a PR in just a bit here. > [R] Cannot write parquet to OutputStream > > > Key: ARROW-7690 > URL: https://issues.apache.org/jira/browse/ARROW-7690 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Bob >Priority: Major > > The R package does not allow for the ability to write to a FileOutputStream. > Minimal testing code: > library(arrow) > tf1 <- arrow::FileOutputStream$create(path = "output.parquet") > arrow::write_parquet(data.frame(x = 1:5), tf1) > Throws error: > Error in inherits(sink, OutputStream) : 'what' must be a character vector > > The issue appears to be in line 153 of parquet.R > if (is.character(sink)) { > sink <- FileOutputStream$create(sink) > on.exit(sink$close()) > } *else if (!inherits(sink, OutputStream))* { > abort("sink must be a file path or an OutputStream") > } > > Should be !inherits(sink,'OutputStream') -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7690) Cannot write parquet to OutputStream
Bob created ARROW-7690: -- Summary: Cannot write parquet to OutputStream Key: ARROW-7690 URL: https://issues.apache.org/jira/browse/ARROW-7690 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.15.1 Reporter: Bob The R package does not allow for the ability to write to a FileOutputStream. Minimal testing code: library(arrow) tf1 <- arrow::FileOutputStream$create(path = "output.parquet") arrow::write_parquet(data.frame(x = 1:5), tf1) Throws error: Error in inherits(sink, OutputStream) : 'what' must be a character vector The issue appears to be in line 153 of parquet.R if (is.character(sink)) { sink <- FileOutputStream$create(sink) on.exit(sink$close()) } *else if (!inherits(sink, OutputStream))* { abort("sink must be a file path or an OutputStream") } Should be !inherits(sink,'OutputStream') -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951177#comment-16951177 ] Bob commented on ARROW-6876: I also tried fastparquet as an engine and it just thrown an error to me when reading the file.. Seems it just cannot decode the file. > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951176#comment-16951176 ] Bob commented on ARROW-6876: [~jorisvandenbossche] thanks. let me know if I can help. We are very special in this case I think, Also I am not sure if the multilevel columns thing adds any complexity – seems parquet do not handle this very well? > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951172#comment-16951172 ] Bob edited comment on ARROW-6876 at 10/14/19 5:18 PM: -- [~jorisvandenbossche] seems you guys started calling this function which caused the issue: [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118] was (Author: dorafmon): [~jorisvandenbossche] seems you guys added this function which caused the issue: [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118] > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951172#comment-16951172 ] Bob commented on ARROW-6876: [~jorisvandenbossche] seems you guys added this function which caused the issue: [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118] > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951168#comment-16951168 ] Bob commented on ARROW-6876: [~jorisvandenbossche] sorry I cannot share the data with you because they contain our IP. Something I can share with is: In [6]: df.shape Out[6]: (61, 31835) All fields are just pain floats, I believe you can create a dataframe just like this with no difficulties? One thing to note is that in our dataframe we use multilevel columns. But I suppose that is not an issue? > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated ARROW-6876: --- Description: Hi, I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas. Example: *With 0.14.1* In [4]: %timeit df = pd.read_parquet(path) 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) *With 0.15.0* In [5]: %timeit df = pd.read_parquet(path) 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas. Have you received similar complain? What could be the issue here? Thanks a lot. Edit1: Some profiling I did: 0.14.1: !image-2019-10-14-18-12-07-652.png! 0.15.0: !image-2019-10-14-18-10-42-850.png! was: Hi, I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas. Example: *With 0.14.1* In [4]: %timeit df = pd.read_parquet(path) 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) *With 0.15.0* In [5]: %timeit df = pd.read_parquet(path) 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas. Have you received similar complain? What could be the issue here? Thanks a lot. > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated ARROW-6876: --- Attachment: image-2019-10-14-18-12-07-652.png > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bob updated ARROW-6876: --- Attachment: image-2019-10-14-18-10-42-850.png > Reading parquet file becomes really slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Priority: Major > Attachments: image-2019-10-14-18-10-42-850.png > > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0
Bob created ARROW-6876: -- Summary: Reading parquet file becomes really slow for 0.15.0 Key: ARROW-6876 URL: https://issues.apache.org/jira/browse/ARROW-6876 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.0 Environment: python3.7 Reporter: Bob Hi, I just noticed that reading a parquet file becomes really slow after I upgraded to 0.15.0 when using pandas. Example: *With 0.14.1* In [4]: %timeit df = pd.read_parquet(path) 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) *With 0.15.0* In [5]: %timeit df = pd.read_parquet(path) 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) The file is about 15MB in size. I am testing on the same machine using the same version of python and pandas. Have you received similar complain? What could be the issue here? Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005)