[jira] [Comment Edited] (ARROW-2709) [Python] write_to_dataset poor performance when splitting

Lee June Woo (JIRA) Wed, 26 Dec 2018 01:21:56 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728948#comment-16728948
 ]


Lee June Woo edited comment on ARROW-2709 at 12/26/18 9:20 AM:
---------------------------------------------------------------

Hello,

May I ask you simple question about the improvement? I think that It seem to be 
more efficient to split the pandas dataframe base on "dt" column before 
converting dataframe to arrow table.

Would you have any plan to implement group-by operation of arrow table or 
improve write_to_dataset function?


was (Author: na11an):
Hello,

May I ask you simple question about the improvement? I think that It seem to be 
more efficient to split the pandas dataframe base on "dt" column before 
converting dataframe to arrow table.

Would you have any plan to implement group-by operation of arrow table or 
improve write_to_dataset function? I hope to jump in this issue and contribute 
this project as possible as I could if there's no time-constraint.

> [Python] write_to_dataset poor performance when splitting
> ---------------------------------------------------------
>
>                 Key: ARROW-2709
>                 URL: https://issues.apache.org/jira/browse/ARROW-2709
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Olaf
>            Priority: Critical
>              Labels: parquet
>
> Hello,
> Posting this from github (master [~wesmckinn] asked for it :) )
> [https://github.com/apache/arrow/issues/2138]
>  
> {code:java}
> import pandas as pd 
> import numpy as np 
> import pyarrow.parquet as pq 
> import pyarrow as pa 
> idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', 
> freq = 'T') 
> dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)), 
>                           'string_col' : 
> pd.util.testing.rands_array(8,len(idx))}, 
>                          index = idx){code}
>  
> {code:java}
> df["dt"] = df.index 
> df["dt"] = df["dt"].dt.date 
> table = pa.Table.from_pandas(df) 
> pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], 
> flavor='spark'){code}
>  
> {{this works but is inefficient memory-wise. The arrow table is a copy of the 
> large pandas daframe and quickly saturates the RAM.}}
>  
> {{Thanks!}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-2709) [Python] write_to_dataset poor performance when splitting

Reply via email to