[jira] [Commented] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369503#comment-16369503
 ] 

ASF GitHub Bot commented on ARROW-2160:
---

cpcloud commented on issue #1618: ARROW-2160: [C++/Python]  Fix decimal 
precision inference
URL: https://github.com/apache/arrow/pull/1618#issuecomment-366798968
 
 
   @wesm @pitrou I reintroduced boost regex for parsing the decimal number, to 
address ARROW-2153. We were using a fairly straightforward algorithm before 
that implemented the regular expression by hand but it didn't allow numbers 
like `1e1`. The regular expression to match that would be very complex and hard 
to read if written by hand so I decided to use boost regex. I didn't go with 
STL library because it doesn't support named capture groups, which I think make 
the code much more readable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369311#comment-16369311
 ] 

ASF GitHub Bot commented on ARROW-2160:
---

cpcloud commented on a change in pull request #1618: ARROW-2160: [C++/Python]  
Fix decimal precision inference
URL: https://github.com/apache/arrow/pull/1618#discussion_r169129512
 
 

 ##
 File path: cpp/src/arrow/python/helpers.cc
 ##
 @@ -99,7 +99,8 @@ Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* precisio
   DCHECK_NE(precision, NULLPTR);
   DCHECK_NE(scale, NULLPTR);
 
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", "()"));
+  // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
 
 Review comment:
   I have a couple more tests to add here


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369286#comment-16369286
 ] 

ASF GitHub Bot commented on ARROW-2160:
---

wesm commented on a change in pull request #1618: ARROW-2160: [C++/Python]  Fix 
decimal precision inference
URL: https://github.com/apache/arrow/pull/1618#discussion_r169121895
 
 

 ##
 File path: cpp/src/arrow/python/helpers.cc
 ##
 @@ -99,7 +99,8 @@ Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* precisio
   DCHECK_NE(precision, NULLPTR);
   DCHECK_NE(scale, NULLPTR);
 
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", "()"));
+  // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
 
 Review comment:
   Do you want to wait for ARROW-2145 or merge this independently?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367876#comment-16367876
 ] 

ASF GitHub Bot commented on ARROW-2160:
---

cpcloud commented on a change in pull request #1618: ARROW-2160: [C++/Python]  
Fix decimal precision inference
URL: https://github.com/apache/arrow/pull/1618#discussion_r168868803
 
 

 ##
 File path: cpp/src/arrow/python/helpers.cc
 ##
 @@ -99,7 +99,8 @@ Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* precisio
   DCHECK_NE(precision, NULLPTR);
   DCHECK_NE(scale, NULLPTR);
 
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", "()"));
+  // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
 
 Review comment:
   This requires ARROW-2145 to address.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367872#comment-16367872
 ] 

ASF GitHub Bot commented on ARROW-2160:
---

cpcloud commented on a change in pull request #1618: ARROW-2160: [C++/Python]  
Fix decimal precision inference
URL: https://github.com/apache/arrow/pull/1618#discussion_r168868803
 
 

 ##
 File path: cpp/src/arrow/python/helpers.cc
 ##
 @@ -99,7 +99,8 @@ Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* precisio
   DCHECK_NE(precision, NULLPTR);
   DCHECK_NE(scale, NULLPTR);
 
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", "()"));
+  // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
 
 Review comment:
   This requires ARROW-2160 to address.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)