[jira] [Commented] (ARROW-7690) [R] Cannot write parquet to OutputStream

2020-01-27 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024470#comment-17024470
 ] 

Bob commented on ARROW-7690:


Yeah, I can submit a PR in just a bit here.

> [R] Cannot write parquet to OutputStream
> 
>
> Key: ARROW-7690
> URL: https://issues.apache.org/jira/browse/ARROW-7690
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Bob
>Priority: Major
>
> The R package does not allow for the ability to write to a FileOutputStream. 
> Minimal testing code:
> library(arrow)
> tf1 <- arrow::FileOutputStream$create(path = "output.parquet")
> arrow::write_parquet(data.frame(x = 1:5), tf1)
> Throws error:
> Error in inherits(sink, OutputStream) : 'what' must be a character vector
>  
> The issue appears to be in line 153 of parquet.R
> if (is.character(sink)) {
>  sink <- FileOutputStream$create(sink)
>  on.exit(sink$close())
>  } *else if (!inherits(sink, OutputStream))* {
>  abort("sink must be a file path or an OutputStream")
>  }
>  
> Should be !inherits(sink,'OutputStream')



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7690) Cannot write parquet to OutputStream

2020-01-27 Thread Bob (Jira)
Bob created ARROW-7690:
--

 Summary: Cannot write parquet to OutputStream
 Key: ARROW-7690
 URL: https://issues.apache.org/jira/browse/ARROW-7690
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.15.1
Reporter: Bob


The R package does not allow for the ability to write to a FileOutputStream. 

Minimal testing code:
library(arrow)
tf1 <- arrow::FileOutputStream$create(path = "output.parquet")
arrow::write_parquet(data.frame(x = 1:5), tf1)

Throws error:

Error in inherits(sink, OutputStream) : 'what' must be a character vector

 

The issue appears to be in line 153 of parquet.R

if (is.character(sink)) {
 sink <- FileOutputStream$create(sink)
 on.exit(sink$close())
 } *else if (!inherits(sink, OutputStream))* {
 abort("sink must be a file path or an OutputStream")
 }

 

Should be !inherits(sink,'OutputStream')



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951177#comment-16951177
 ] 

Bob commented on ARROW-6876:


I also tried fastparquet as an engine and it just thrown an error to me when 
reading the file.. Seems it just cannot decode the file.

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951176#comment-16951176
 ] 

Bob commented on ARROW-6876:


[~jorisvandenbossche] thanks. let me know if I can help. We are very special in 
this case I think, Also I am not sure if the multilevel columns thing adds any 
complexity – seems parquet do not handle this very well?

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951172#comment-16951172
 ] 

Bob edited comment on ARROW-6876 at 10/14/19 5:18 PM:
--

[~jorisvandenbossche] seems you guys started calling this function which caused 
the issue:

 

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118]


was (Author: dorafmon):
[~jorisvandenbossche] seems you guys added this function which caused the issue:

 

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118]

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951172#comment-16951172
 ] 

Bob commented on ARROW-6876:


[~jorisvandenbossche] seems you guys added this function which caused the issue:

 

[https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L1118]

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951168#comment-16951168
 ] 

Bob commented on ARROW-6876:


[~jorisvandenbossche] sorry I cannot share the data with you because they 
contain our IP. Something I can share with is:

 

In [6]: df.shape
Out[6]: (61, 31835)

 

All fields are just pain floats, I believe you can create a dataframe just like 
this with no difficulties?

 

One thing to note is that in our dataframe we use multilevel columns. But I 
suppose that is not an issue?

 

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated ARROW-6876:
---
Description: 
Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded 
to 0.15.0 when using pandas.

 

Example:

*With 0.14.1*
 In [4]: %timeit df = pd.read_parquet(path)
 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*With 0.15.0*
 In [5]: %timeit df = pd.read_parquet(path)
 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same 
version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 

Edit1:

Some profiling I did:

0.14.1:

!image-2019-10-14-18-12-07-652.png!

 

0.15.0:

!image-2019-10-14-18-10-42-850.png!

 

  was:
Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded 
to 0.15.0 when using pandas.

 

Example:

*With 0.14.1*
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*With 0.15.0*
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same 
version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 


> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated ARROW-6876:
---
Attachment: image-2019-10-14-18-12-07-652.png

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
> In [4]: %timeit df = pd.read_parquet(path)
> 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
> In [5]: %timeit df = pd.read_parquet(path)
> 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bob updated ARROW-6876:
---
Attachment: image-2019-10-14-18-10-42-850.png

> Reading parquet file becomes really slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Priority: Major
> Attachments: image-2019-10-14-18-10-42-850.png
>
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
> In [4]: %timeit df = pd.read_parquet(path)
> 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
> In [5]: %timeit df = pd.read_parquet(path)
> 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6876) Reading parquet file becomes really slow for 0.15.0

2019-10-14 Thread Bob (Jira)
Bob created ARROW-6876:
--

 Summary: Reading parquet file becomes really slow for 0.15.0
 Key: ARROW-6876
 URL: https://issues.apache.org/jira/browse/ARROW-6876
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.0
 Environment: python3.7
Reporter: Bob


Hi,

 

I just noticed that reading a parquet file becomes really slow after I upgraded 
to 0.15.0 when using pandas.

 

Example:

*With 0.14.1*
In [4]: %timeit df = pd.read_parquet(path)
2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*With 0.15.0*
In [5]: %timeit df = pd.read_parquet(path)
22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

 

The file is about 15MB in size. I am testing on the same machine using the same 
version of python and pandas.

 

Have you received similar complain? What could be the issue here?

 

Thanks a lot.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)