[jira] [Updated] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-09-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5220:
--
Labels: pull-request-available  (was: )

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-08-23 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5220:
-
Fix Version/s: 0.15.0

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.15.0
>
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-04-26 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5220:
-
Description: 
The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)

  was:
The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)


> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>