[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-09-13 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929175#comment-16929175
 ] 

Joris Van den Bossche edited comment on ARROW-5220 at 9/13/19 1:15 PM:
---

I started looking into this (as it was tagged for 0.15.0), but it has quite 
some consequences / corner cases. So assume we start expecting the index also 
to be present in the schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?
- What if there are multiple index levels (a MultiIndex), but only one of them 
is specified in the schema? (in case of columns, then that column that is not 
the in the schema is ignored)
- What if the index is specified in the schema, but is actually a RangeIndex 
which would otherwise be serialized as metadata?


was (Author: jorisvandenbossche):
I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?
- What if there are multiple index levels (a MultiIndex), but only one of them 
is specified in the schema? (in case of columns, then that column that is not 
the in the schema is ignored)
- What if the index is specified in the schema, but is actually a RangeIndex 
which would otherwise be serialized as metadata?

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.15.0
>
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-09-13 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929175#comment-16929175
 ] 

Joris Van den Bossche edited comment on ARROW-5220 at 9/13/19 1:14 PM:
---

I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?
- What if there are multiple index levels (a MultiIndex), but only one of them 
is specified in the schema? (in case of columns, then that column that is not 
the in the schema is ignored)
- What if the index is specified in the schema, but is actually a RangeIndex 
which would otherwise be serialized as metadata?


was (Author: jorisvandenbossche):
I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?
- What if there are multiple index levels (a MultiIndex), but only one of them 
is specified in the schema? (in case of columns, then that column that is not 
the in the schema is ignored)


> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.15.0
>
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-09-13 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929175#comment-16929175
 ] 

Joris Van den Bossche edited comment on ARROW-5220 at 9/13/19 1:07 PM:
---

I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?
- What if there are multiple index levels (a MultiIndex), but only one of them 
is specified in the schema? (in case of columns, then that column that is not 
the in the schema is ignored)



was (Author: jorisvandenbossche):
I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?



> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.15.0
>
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-09-13 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929175#comment-16929175
 ] 

Joris Van den Bossche edited comment on ARROW-5220 at 9/13/19 1:03 PM:
---

I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)
- What if an index is specified in the schema but {{preserve_index=False}} ?




was (Author: jorisvandenbossche):
I started looking into this (as it was tagged for 0.15.0), but it has some 
consequences. So assume we start expecting the index also to be present in the 
schema, if specified:

- are we OK with erroring if the index is not in the schema but would be 
written as a column? And only if {{preserve_index=True}}, or also with 
{{preserve_index=None}} in case the index is not a RangeIndex ? 
  This will break some current usage (but can probably do it with a deprecation 
first)
- We should follow the order of the columns in the schema, also for the index? 
(currently the index is always appended to the other columns)




> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
> Fix For: 0.15.0
>
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-08-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913752#comment-16913752
 ] 

Wes McKinney edited comment on ARROW-5220 at 8/22/19 10:30 PM:
---

I'm in theory on board with that idea


was (Author: wesmckinn):
I'm in theory on board with that idae

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-06-13 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863260#comment-16863260
 ] 

Benjamin Kietzman edited comment on ARROW-5220 at 6/13/19 4:49 PM:
---

I agree that the schema should be the single point of truth and it seems most 
reasonable to raise an error when a field in the schema does not correspond to 
a column in the DataFrame.

Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

This would disambiguate when a name should be removed from the schema because 
it refers to the dataframe's index.


was (Author: bkietz):
Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

This would disambiguate when a name should be removed from the schema because 
it refers to the dataframe's index

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-06-13 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863260#comment-16863260
 ] 

Benjamin Kietzman edited comment on ARROW-5220 at 6/13/19 4:46 PM:
---

Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

This would disambiguate when a name should be removed from the schema because 
it refers to the dataframe's index


was (Author: bkietz):
Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-06-13 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863260#comment-16863260
 ] 

Benjamin Kietzman edited comment on ARROW-5220 at 6/13/19 4:43 PM:
---

Would it be an acceptable solution to require {{preserve_index=True}} when 
specifying the type of the index with a schema?

IE, ensure the following works as you intend (currently fails):
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}


was (Author: bkietz):
Also fails:
{code}
table = pa.Table.from_pandas(df, schema=my_schema, preserve_index=True)
{code}

Would it be an acceptable solution to require {{preserve_index}} when 
specifying the type of the index with a schema?

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -
>
> Key: ARROW-5220
> URL: https://issues.apache.org/jira/browse/ARROW-5220
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Minor
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>    ('a', pa.int64()),
>    ('b', pa.float64()),
>   ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)