[jira] [Commented] (ARROW-2182) [Python] ASV benchmark setup does not account for C++ library changing

2018-02-19 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369788#comment-16369788
 ] 

Antoine Pitrou commented on ARROW-2182:
---

Ideally we would hook into ASV, to be able to use commands such as "asv run" 
and collect results automatically for display in the web UI. Unfortunately ASV 
currently hardcodes its calls to "setup.py build" and "pip install".

> [Python] ASV benchmark setup does not account for C++ library changing
> --
>
> Key: ARROW-2182
> URL: https://issues.apache.org/jira/browse/ARROW-2182
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> See https://github.com/apache/arrow/blob/master/python/README-benchmarks.md
> Perhaps we could create a helper script that will run all the 
> currently-defined benchmarks for a specific commit, and ensure that we are 
> running against pristine, up-to-date release builds of Arrow (and any other 
> dependencies, like parquet-cpp) at that commit? 
> cc [~pitrou]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2172) [Python] Incorrect conversion from Numpy array when stride % itemsize != 0

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369757#comment-16369757
 ] 

ASF GitHub Bot commented on ARROW-2172:
---

pitrou commented on a change in pull request #1628: ARROW-2172: [C++/Python] 
Fix converting from Numpy array with non-natural stride
URL: https://github.com/apache/arrow/pull/1628#discussion_r169225590
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -554,12 +554,22 @@ Status StaticCastBuffer(const Buffer& input, const 
int64_t length, MemoryPool* p
   return Status::OK();
 }
 
-template 
-void CopyStrided(T* input_data, int64_t length, int64_t stride, T2* 
output_data) {
+template 
+void CopyStridedBytewise(int8_t* input_data, int64_t length, int64_t stride,
+ T* output_data) {
+  // Passing input_data as non-const is a concession to PyObject*
+  for (int64_t i = 0; i < length; ++i) {
+memcpy(output_data + i, input_data, sizeof(T));
+input_data += stride;
 
 Review comment:
   Well, the current code actually works as intended even with such arrays:
   ```
   >>> base = np.arange(8, dtype=np.int8).view(np.int32)
   >>> arr = np.lib.stride_tricks.as_strided(base, shape=(5,), strides=(1,))
   >>> arr
   array([ 50462976,  67305985,  84148994, 100992003, 117835012], dtype=int32)
   >>> arr.tobytes()
   
b'\x00\x01\x02\x03\x01\x02\x03\x04\x02\x03\x04\x05\x03\x04\x05\x06\x04\x05\x06\x07'
   >>> pa.array(arr, type=pa.int32())
   
   [
 50462976,
 67305985,
 84148994,
 100992003,
 117835012
   ]
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Incorrect conversion from Numpy array when stride % itemsize != 0
> --
>
> Key: ARROW-2172
> URL: https://issues.apache.org/jira/browse/ARROW-2172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> In the example below, the input array has a stride that's not a multiple of 
> the itemsize:
> {code:python}
> >>> data = np.array([(42, True), (43, False)],
> ...:dtype=[('x', np.int32), ('y', np.bool_)])
> ...:
> ...:
> >>> data['x']
> array([42, 43], dtype=int32)
> >>> pa.array(data['x'], type=pa.int32())
> 
> [
>   42,
>   11009
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2172) [Python] Incorrect conversion from Numpy array when stride % itemsize != 0

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369753#comment-16369753
 ] 

ASF GitHub Bot commented on ARROW-2172:
---

cpcloud commented on a change in pull request #1628: ARROW-2172: [C++/Python] 
Fix converting from Numpy array with non-natural stride
URL: https://github.com/apache/arrow/pull/1628#discussion_r169224812
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -554,12 +554,22 @@ Status StaticCastBuffer(const Buffer& input, const 
int64_t length, MemoryPool* p
   return Status::OK();
 }
 
-template 
-void CopyStrided(T* input_data, int64_t length, int64_t stride, T2* 
output_data) {
+template 
+void CopyStridedBytewise(int8_t* input_data, int64_t length, int64_t stride,
+ T* output_data) {
+  // Passing input_data as non-const is a concession to PyObject*
+  for (int64_t i = 0; i < length; ++i) {
+memcpy(output_data + i, input_data, sizeof(T));
+input_data += stride;
 
 Review comment:
   No, but maybe add a DCHECK?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Incorrect conversion from Numpy array when stride % itemsize != 0
> --
>
> Key: ARROW-2172
> URL: https://issues.apache.org/jira/browse/ARROW-2172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> In the example below, the input array has a stride that's not a multiple of 
> the itemsize:
> {code:python}
> >>> data = np.array([(42, True), (43, False)],
> ...:dtype=[('x', np.int32), ('y', np.bool_)])
> ...:
> ...:
> >>> data['x']
> array([42, 43], dtype=int32)
> >>> pa.array(data['x'], type=pa.int32())
> 
> [
>   42,
>   11009
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2172) [Python] Incorrect conversion from Numpy array when stride % itemsize != 0

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369747#comment-16369747
 ] 

ASF GitHub Bot commented on ARROW-2172:
---

pitrou commented on a change in pull request #1628: ARROW-2172: [C++/Python] 
Fix converting from Numpy array with non-natural stride
URL: https://github.com/apache/arrow/pull/1628#discussion_r169223711
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -554,12 +554,22 @@ Status StaticCastBuffer(const Buffer& input, const 
int64_t length, MemoryPool* p
   return Status::OK();
 }
 
-template 
-void CopyStrided(T* input_data, int64_t length, int64_t stride, T2* 
output_data) {
+template 
+void CopyStridedBytewise(int8_t* input_data, int64_t length, int64_t stride,
+ T* output_data) {
+  // Passing input_data as non-const is a concession to PyObject*
+  for (int64_t i = 0; i < length; ++i) {
+memcpy(output_data + i, input_data, sizeof(T));
+input_data += stride;
 
 Review comment:
   In normal use, it probably is, but you can create weird arrays using 
`stride_tricks`: 
https://docs.scipy.org/doc/numpy/reference/generated/numpy.lib.stride_tricks.as_strided.html
   
   That said, I don't think it makes a problem here, does it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Incorrect conversion from Numpy array when stride % itemsize != 0
> --
>
> Key: ARROW-2172
> URL: https://issues.apache.org/jira/browse/ARROW-2172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> In the example below, the input array has a stride that's not a multiple of 
> the itemsize:
> {code:python}
> >>> data = np.array([(42, True), (43, False)],
> ...:dtype=[('x', np.int32), ('y', np.bool_)])
> ...:
> ...:
> >>> data['x']
> array([42, 43], dtype=int32)
> >>> pa.array(data['x'], type=pa.int32())
> 
> [
>   42,
>   11009
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud closed ARROW-2188.

Resolution: Fixed

This is no longer a problem.

> [JS] Error on Travis-CI during gulp build
> -
>
> Key: ARROW-2188
> URL: https://issues.apache.org/jira/browse/ARROW-2188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> Failing builds:
> https://travis-ci.org/apache/arrow/jobs/343649349
> https://travis-ci.org/apache/arrow/jobs/343649353
> Error message:
> {code}
> Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Kouhei Sutou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369743#comment-16369743
 ] 

Kouhei Sutou commented on ARROW-2189:
-

You can't use Ubuntu Trusty packages on Debian Jessie.

Can you try to use Debian Stretch or Ubuntu Trusty? They have packages for 
their platform.

> [C++] Seg. fault on make_shared
> ---
>
> Key: ARROW-2189
> URL: https://issues.apache.org/jira/browse/ARROW-2189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
> Environment: Debian jessie in a Docker container
> libarrow-dev 0.8.0-2 (Ubuntu trusty)
>Reporter: Rares Vernica
>Priority: Major
>
> When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
> {{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it 
> works fine. Here is an example:
> {code:java}
> #include 
> int main()
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Int64Builder builder(pool);
> builder.Append(1);
> // #1
> // std::shared_ptr buffer(new arrow::PoolBuffer(pool));
> // #2
> // std::shared_ptr buffer;
> // buffer.reset(new arrow::PoolBuffer(pool));
> // #3
> auto buffer = std::make_shared(pool);
> }
> {code}
> {code:java}
> > g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
> Segmentation fault (core dumped)
> {code}
> The example works fine with {{#1}} or {{#2}} options. It also works if the 
> builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369733#comment-16369733
 ] 

Wes McKinney commented on ARROW-2189:
-

Seems like this could be a mixed C++ standard library toolchain issue -- [~kou] 
do you know what the policy is for using Trusty packages on Debian Jessie 
platform? 

> [C++] Seg. fault on make_shared
> ---
>
> Key: ARROW-2189
> URL: https://issues.apache.org/jira/browse/ARROW-2189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
> Environment: Debian jessie in a Docker container
> libarrow-dev 0.8.0-2 (Ubuntu trusty)
>Reporter: Rares Vernica
>Priority: Major
>
> When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
> {{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it 
> works fine. Here is an example:
> {code:java}
> #include 
> int main()
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Int64Builder builder(pool);
> builder.Append(1);
> // #1
> // std::shared_ptr buffer(new arrow::PoolBuffer(pool));
> // #2
> // std::shared_ptr buffer;
> // buffer.reset(new arrow::PoolBuffer(pool));
> // #3
> auto buffer = std::make_shared(pool);
> }
> {code}
> {code:java}
> > g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
> Segmentation fault (core dumped)
> {code}
> The example works fine with {{#1}} or {{#2}} options. It also works if the 
> builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Rares Vernica (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369721#comment-16369721
 ] 

Rares Vernica edited comment on ARROW-2189 at 2/20/18 5:02 AM:
---

Does this help?
{code:java}
> g++-4.9 -ggdb -std=c++11 -larrow foo.cpp 
> strace ./a.out 
strace: test_ptrace_setoptions_for_all: PTRACE_TRACEME doesn't work: Operation 
not permitted
strace: test_ptrace_setoptions_for_all: unexpected exit status 1
> gdb ./a.out 
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...done.
(gdb) run
Starting program: /a.out 
warning: Error disabling address space randomization: Operation not permitted
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) strace 
No default breakpoint address now.
(gdb) backtrace 
No stack.
{code}
Otherwise I can get you a Dockerfile.

I see a bunch of SELinux errors on the host, which is a Fedora 27, every time 
this crushes.


was (Author: rvernica):
Does this help?
{code:java}
> g++-4.9 -ggdb -std=c++11 -larrow foo.cpp 
 > strace ./a.out 
strace: test_ptrace_setoptions_for_all: PTRACE_TRACEME doesn't work: Operation 
not permitted
strace: test_ptrace_setoptions_for_all: unexpected exit status 1
> gdb ./a.out 
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...done.
(gdb) run
Starting program: /a.out 
warning: Error disabling address space randomization: Operation not permitted
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) strace 
No default breakpoint address now.
(gdb) backtrace 
No stack.
{code}
Otherwise I can get you a Dockerfile.

I see a bunch of SELinux errors on the host, which is a Fedora 27, every time 
this crushes.

> [C++] Seg. fault on make_shared
> ---
>
> Key: ARROW-2189
> URL: https://issues.apache.org/jira/browse/ARROW-2189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
> Environment: Debian jessie in a Docker container
> libarrow-dev 0.8.0-2 (Ubuntu trusty)
>Reporter: Rares Vernica
>Priority: Major
>
> When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
> {{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it 
> works fine. Here is an example:
> {code:java}
> #include 
> int main()
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Int64Builder builder(pool);
> builder.Append(1);
> // #1
> // std::shared_ptr buffer(new arrow::PoolBuffer(pool));
> // #2
> // std::shared_ptr buffer;
> // buffer.reset(new arrow::PoolBuffer(pool));
> // #3
> auto buffer = std::make_shared(pool);
> }
> {code}
> {code:java}
> > g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
> Segmentation fault (core dumped)
> {code}
> The example works fine with {{#1}} or {{#2}} options. It also works if the 
> builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Rares Vernica (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369721#comment-16369721
 ] 

Rares Vernica commented on ARROW-2189:
--

Does this help?
{code:java}
> g++-4.9 -ggdb -std=c++11 -larrow foo.cpp 
 > strace ./a.out 
strace: test_ptrace_setoptions_for_all: PTRACE_TRACEME doesn't work: Operation 
not permitted
strace: test_ptrace_setoptions_for_all: unexpected exit status 1
> gdb ./a.out 
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...done.
(gdb) run
Starting program: /a.out 
warning: Error disabling address space randomization: Operation not permitted
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) strace 
No default breakpoint address now.
(gdb) backtrace 
No stack.
{code}
Otherwise I can get you a Dockerfile.

I see a bunch of SELinux errors on the host, which is a Fedora 27, every time 
this crushes.

> [C++] Seg. fault on make_shared
> ---
>
> Key: ARROW-2189
> URL: https://issues.apache.org/jira/browse/ARROW-2189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
> Environment: Debian jessie in a Docker container
> libarrow-dev 0.8.0-2 (Ubuntu trusty)
>Reporter: Rares Vernica
>Priority: Major
>
> When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
> {{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it 
> works fine. Here is an example:
> {code:java}
> #include 
> int main()
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Int64Builder builder(pool);
> builder.Append(1);
> // #1
> // std::shared_ptr buffer(new arrow::PoolBuffer(pool));
> // #2
> // std::shared_ptr buffer;
> // buffer.reset(new arrow::PoolBuffer(pool));
> // #3
> auto buffer = std::make_shared(pool);
> }
> {code}
> {code:java}
> > g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
> Segmentation fault (core dumped)
> {code}
> The example works fine with {{#1}} or {{#2}} options. It also works if the 
> builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369712#comment-16369712
 ] 

Phillip Cloud commented on ARROW-2189:
--

Or, do you have a link to the Dockerfile or the image up in a docker registry?

> [C++] Seg. fault on make_shared
> ---
>
> Key: ARROW-2189
> URL: https://issues.apache.org/jira/browse/ARROW-2189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
> Environment: Debian jessie in a Docker container
> libarrow-dev 0.8.0-2 (Ubuntu trusty)
>Reporter: Rares Vernica
>Priority: Major
>
> When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
> {{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it 
> works fine. Here is an example:
> {code:java}
> #include 
> int main()
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Int64Builder builder(pool);
> builder.Append(1);
> // #1
> // std::shared_ptr buffer(new arrow::PoolBuffer(pool));
> // #2
> // std::shared_ptr buffer;
> // buffer.reset(new arrow::PoolBuffer(pool));
> // #3
> auto buffer = std::make_shared(pool);
> }
> {code}
> {code:java}
> > g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
> Segmentation fault (core dumped)
> {code}
> The example works fine with {{#1}} or {{#2}} options. It also works if the 
> builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2188:


Assignee: Phillip Cloud

> [JS] Error on Travis-CI during gulp build
> -
>
> Key: ARROW-2188
> URL: https://issues.apache.org/jira/browse/ARROW-2188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> Failing builds:
> https://travis-ci.org/apache/arrow/jobs/343649349
> https://travis-ci.org/apache/arrow/jobs/343649353
> Error message:
> {code}
> Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369711#comment-16369711
 ] 

Phillip Cloud commented on ARROW-2189:
--

Can you run {{gdb ./a.out}} and paste the stack trace here? I can't reproduce 
this.

> [C++] Seg. fault on make_shared
> ---
>
> Key: ARROW-2189
> URL: https://issues.apache.org/jira/browse/ARROW-2189
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
> Environment: Debian jessie in a Docker container
> libarrow-dev 0.8.0-2 (Ubuntu trusty)
>Reporter: Rares Vernica
>Priority: Major
>
> When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
> {{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it 
> works fine. Here is an example:
> {code:java}
> #include 
> int main()
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> arrow::Int64Builder builder(pool);
> builder.Append(1);
> // #1
> // std::shared_ptr buffer(new arrow::PoolBuffer(pool));
> // #2
> // std::shared_ptr buffer;
> // buffer.reset(new arrow::PoolBuffer(pool));
> // #3
> auto buffer = std::make_shared(pool);
> }
> {code}
> {code:java}
> > g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
> Segmentation fault (core dumped)
> {code}
> The example works fine with {{#1}} or {{#2}} options. It also works if the 
> builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369708#comment-16369708
 ] 

Phillip Cloud commented on ARROW-2188:
--

looks like we might've hit this: 
https://github.com/jonschlinkert/regex-not/issues/3

around the time when the patch was released here: 
https://github.com/jonschlinkert/regex-not/commit/335ef057744980b211a048f6b287b4690a9bc29f

> [JS] Error on Travis-CI during gulp build
> -
>
> Key: ARROW-2188
> URL: https://issues.apache.org/jira/browse/ARROW-2188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> Failing builds:
> https://travis-ci.org/apache/arrow/jobs/343649349
> https://travis-ci.org/apache/arrow/jobs/343649353
> Error message:
> {code}
> Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369710#comment-16369710
 ] 

Phillip Cloud commented on ARROW-2188:
--

restarted the builds to see if they pick up the new version

> [JS] Error on Travis-CI during gulp build
> -
>
> Key: ARROW-2188
> URL: https://issues.apache.org/jira/browse/ARROW-2188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> Failing builds:
> https://travis-ci.org/apache/arrow/jobs/343649349
> https://travis-ci.org/apache/arrow/jobs/343649353
> Error message:
> {code}
> Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2189) [C++] Seg. fault on make_shared

2018-02-19 Thread Rares Vernica (JIRA)
Rares Vernica created ARROW-2189:


 Summary: [C++] Seg. fault on make_shared
 Key: ARROW-2189
 URL: https://issues.apache.org/jira/browse/ARROW-2189
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.8.0
 Environment: Debian jessie in a Docker container
libarrow-dev 0.8.0-2 (Ubuntu trusty)

Reporter: Rares Vernica


When creating a {{PoolBuffer}}, I get a {{Segmentation fault}} when I use 
{{make_shared}}. If I use the {{shared_ptr}} constructor of {{reset}}, it works 
fine. Here is an example:
{code:java}
#include 

int main()
{
arrow::MemoryPool* pool = arrow::default_memory_pool();

arrow::Int64Builder builder(pool);
builder.Append(1);

// #1
// std::shared_ptr buffer(new arrow::PoolBuffer(pool));
// #2
// std::shared_ptr buffer;
// buffer.reset(new arrow::PoolBuffer(pool));
// #3
auto buffer = std::make_shared(pool);
}
{code}
{code:java}
> g++-4.9 -std=c++11 -larrow foo.cpp && ./a.out
Segmentation fault (core dumped)
{code}
The example works fine with {{#1}} or {{#2}} options. It also works if the 
builder is commented out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369686#comment-16369686
 ] 

Phillip Cloud commented on ARROW-2188:
--

cc [~paul.e.taylor] as well

> [JS] Error on Travis-CI during gulp build
> -
>
> Key: ARROW-2188
> URL: https://issues.apache.org/jira/browse/ARROW-2188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Priority: Major
>
> Failing builds:
> https://travis-ci.org/apache/arrow/jobs/343649349
> https://travis-ci.org/apache/arrow/jobs/343649353
> Error message:
> {code}
> Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2188) [JS] Error on Travis-CI during gulp build

2018-02-19 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2188:


 Summary: [JS] Error on Travis-CI during gulp build
 Key: ARROW-2188
 URL: https://issues.apache.org/jira/browse/ARROW-2188
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 0.8.0
Reporter: Phillip Cloud


Failing builds:

https://travis-ci.org/apache/arrow/jobs/343649349
https://travis-ci.org/apache/arrow/jobs/343649353

Error message:

{code}
Error: potentially unsafe regular expression: ^(?:(?!(?:[\[!*+?$^"'.\\/]+)).)+
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2187) RFC: Organize language implementations in a top-level lib/ directory

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2187:
---

 Summary: RFC: Organize language implementations in a top-level 
lib/ directory
 Key: ARROW-2187
 URL: https://issues.apache.org/jira/browse/ARROW-2187
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


As we acquire more Arrow implementations, the number of top-level directories 
may grow significantly. We might consider nesting these implementations under a 
new top-level directory, similar to Apache Thrift: 
https://github.com/apache/thrift (see the "lib/" directory)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369669#comment-16369669
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366853886
 
 
   @xuepanchen I added a template for the 8 bit hash function to avoid 
arithmetic in the uint8 case
   
   before this change:
   
   ```
   $ ./release/compute-benchmark --benchmark_filter=UInt8
   Run on (8 X 4399.69 MHz CPU s)
   2018-02-19 21:57:13
   Benchmark Time   
CPU Iterations
   
---
   BM_UniqueUInt8NoNulls/16M/200/min_time:1.000/real_time 8339 us   
8339 us166   1.87372GB/s
   BM_UniqueUInt8WithNulls/16M/200/min_time:1.000/real_time  28536 us  
28537 us 49 560.7MB/s
   ```
   
   after this change:
   
   ```
   $ ./release/compute-benchmark --benchmark_filter=UInt8
   Run on (8 X 4400 MHz CPU s)
   2018-02-19 21:55:51
   Benchmark Time   
CPU Iterations
   
---
   BM_UniqueUInt8NoNulls/16M/200/min_time:1.000/real_time 7749 us   
7749 us180   2.01641GB/s
   BM_UniqueUInt8WithNulls/16M/200/min_time:1.000/real_time  28042 us  
28042 us 50   570.571MB/s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369668#comment-16369668
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366853563
 
 
   @jreback I thought so, thanks for confirming =)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369667#comment-16369667
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

jreback commented on issue #1551: ARROW-1942: [C++] Hash table specializations 
for small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366853421
 
 
   fyi in pandas we currently promote to int64 for smaller item size for hash 
operations 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369666#comment-16369666
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366852383
 
 
   Top level numbers OK to me:
   
   ```
   In [1]: import numpy as np
   
   In [2]: arr = np.random.randint(0, 200, size=1000)
   
   In [3]: import pyarrow as pa
   
   In [4]: pa
   Out[4]: 
   
   In [5]: parr = pa.array(arr)
   
   In [9]: timeit result = parr.unique()
   33.5 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [10]: import pandas as pd
   
   In [11]: timeit result2 = pd.unique(arr)
   25.7 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [12]: timeit result2 = np.unique(arr)
   296 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [13]: parr_int8 = pa.array(arr.astype('int8'))
   
   In [14]: arr_int8 = arr.astype('int8')
   
   In [15]: timeit result = parr_int8.unique()
   10.1 ms ± 99.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   In [16]: timeit result = pd.unique(arr_int8)
   35.3 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   In [17]: timeit result = np.unique(arr_int8)
   282 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   So we're about 30% slower than pandas for int64 at the moment (for this 
limited benchmark at least), which suggests plenty of room for improvement.
   
   Everything else looks good. +1, will merge on green build. Thanks 
@xuepanchen!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369664#comment-16369664
 ] 

ASF GitHub Bot commented on ARROW-1632:
---

cpcloud commented on a change in pull request #1620: ARROW-1632: [Python] 
Permit categorical conversions in Table.to_pandas on a per-column basis
URL: https://github.com/apache/arrow/pull/1620#discussion_r169206589
 
 

 ##
 File path: python/pyarrow/table.pxi
 ##
 @@ -746,17 +746,22 @@ cdef class RecordBatch:
 
 
 def table_to_blocks(PandasOptions options, Table table, int nthreads,
-MemoryPool memory_pool):
+MemoryPool memory_pool, categories):
 cdef:
 PyObject* result_obj
 shared_ptr[CTable] c_table = table.sp_table
 CMemoryPool* pool
+unordered_set[c_string] categorical_columns
+
+if categories is not None:
+categorical_columns = [tobytes(cat) for cat in categories]
 
 Review comment:
   Can you make this a `set` comprehension? It's misleading to have a Python 
`list` automatically turned into a C++ `unordered_set` IMO.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369661#comment-16369661
 ] 

ASF GitHub Bot commented on ARROW-1632:
---

cpcloud commented on a change in pull request #1620: ARROW-1632: [Python] 
Permit categorical conversions in Table.to_pandas on a per-column basis
URL: https://github.com/apache/arrow/pull/1620#discussion_r169206390
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -986,7 +987,8 @@ class CategoricalBlock : public PandasBlock {
 // Sniff the first chunk
 const std::shared_ptr arr_first = data.chunk(0);
 const auto& dict_arr_first = static_cast(*arr_first);
-const auto& indices_first = static_cast(*dict_arr_first.indices());
+const std::shared_ptr indices_first =
 
 Review comment:
   This can use `auto`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369658#comment-16369658
 ] 

ASF GitHub Bot commented on ARROW-1632:
---

cpcloud commented on a change in pull request #1620: ARROW-1632: [Python] 
Permit categorical conversions in Table.to_pandas on a per-column basis
URL: https://github.com/apache/arrow/pull/1620#discussion_r169205931
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1771,7 +1790,33 @@ Status ConvertColumnToPandas(PandasOptions options, 
const std::shared_ptr& table,
 int nthreads, MemoryPool* pool, PyObject** out) {
-  DataFrameBlockCreator helper(options, table, pool);
+  return ConvertTableToPandas(options, std::unordered_set(), 
table, nthreads,
+  pool, out);
+}
+
+Status ConvertTableToPandas(PandasOptions options,
+const std::unordered_set& 
categorical_columns,
+const std::shared_ptr& table, int nthreads,
+MemoryPool* pool, PyObject** out) {
+  std::shared_ptr current_table = table;
+  if (categorical_columns.size() > 0) {
+FunctionContext ctx;
+for (int64_t i = 0; i < table->num_columns(); i++) {
+  const Column& col = *table->column(i);
+  if (categorical_columns.count(col.name())) {
+Datum out;
+DictionaryEncode(, Datum(col.data()), );
+std::shared_ptr array = out.chunked_array();
+std::shared_ptr field = std::make_shared(
 
 Review comment:
   Do we have a top level `::arrow::field` function that does this? If not, 
then this call should use `auto` since `Field` is already mentioned in the call 
to `make_shared`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369659#comment-16369659
 ] 

ASF GitHub Bot commented on ARROW-1632:
---

cpcloud commented on a change in pull request #1620: ARROW-1632: [Python] 
Permit categorical conversions in Table.to_pandas on a per-column basis
URL: https://github.com/apache/arrow/pull/1620#discussion_r169205957
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1771,7 +1790,33 @@ Status ConvertColumnToPandas(PandasOptions options, 
const std::shared_ptr& table,
 int nthreads, MemoryPool* pool, PyObject** out) {
-  DataFrameBlockCreator helper(options, table, pool);
+  return ConvertTableToPandas(options, std::unordered_set(), 
table, nthreads,
+  pool, out);
+}
+
+Status ConvertTableToPandas(PandasOptions options,
+const std::unordered_set& 
categorical_columns,
+const std::shared_ptr& table, int nthreads,
+MemoryPool* pool, PyObject** out) {
+  std::shared_ptr current_table = table;
+  if (categorical_columns.size() > 0) {
+FunctionContext ctx;
+for (int64_t i = 0; i < table->num_columns(); i++) {
+  const Column& col = *table->column(i);
+  if (categorical_columns.count(col.name())) {
+Datum out;
+DictionaryEncode(, Datum(col.data()), );
+std::shared_ptr array = out.chunked_array();
+std::shared_ptr field = std::make_shared(
+col.name(), array->type(), col.field()->nullable(), 
col.field()->metadata());
+std::shared_ptr column = std::make_shared(field, 
array);
 
 Review comment:
   This should use `auto` as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369657#comment-16369657
 ] 

ASF GitHub Bot commented on ARROW-1632:
---

cpcloud commented on a change in pull request #1620: ARROW-1632: [Python] 
Permit categorical conversions in Table.to_pandas on a per-column basis
URL: https://github.com/apache/arrow/pull/1620#discussion_r169205821
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1771,7 +1790,33 @@ Status ConvertColumnToPandas(PandasOptions options, 
const std::shared_ptr& table,
 int nthreads, MemoryPool* pool, PyObject** out) {
-  DataFrameBlockCreator helper(options, table, pool);
+  return ConvertTableToPandas(options, std::unordered_set(), 
table, nthreads,
+  pool, out);
+}
+
+Status ConvertTableToPandas(PandasOptions options,
+const std::unordered_set& 
categorical_columns,
+const std::shared_ptr& table, int nthreads,
+MemoryPool* pool, PyObject** out) {
+  std::shared_ptr current_table = table;
+  if (categorical_columns.size() > 0) {
 
 Review comment:
   Any reason not to use `!categorical_columns.empty()`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2186) [C++] Clean up architecture specific compiler flags

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2186:
---

 Summary: [C++] Clean up architecture specific compiler flags
 Key: ARROW-2186
 URL: https://issues.apache.org/jira/browse/ARROW-2186
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


I noticed that {{-maltivec}} is being passed to the compiler on Linux, with an 
x86_64 processor. That seemed odd to me. It prompted me to look more generally 
at our compiler flags related to hardware optimizations. We have the ability to 
pass {{-msse3}}, but there is a {{ARROW_USE_SSE}} which is only used as a 
define in some headers. There is {{ARROW_ALTIVEC}}, but no option to pass 
{{-march}}. Nothing related to AVX/AVX2/AVX512. I think this could do for an 
overhaul



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369651#comment-16369651
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-366849040
 
 
   I squashed the branch. Having a look at compute-benchmark


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369648#comment-16369648
 ] 

ASF GitHub Bot commented on ARROW-2175:
---

wesm closed pull request #1630: ARROW-2175: [Python] Install Arrow libraries in 
Travis CI builds when only Python directory is affected
URL: https://github.com/apache/arrow/pull/1630
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.travis.yml b/.travis.yml
index 73a9f4642..a4c74657e 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -61,7 +61,8 @@ matrix:
 - $TRAVIS_BUILD_DIR/ci/travis_install_linux.sh
 - $TRAVIS_BUILD_DIR/ci/travis_install_clang_tools.sh
 - $TRAVIS_BUILD_DIR/ci/travis_lint.sh
-- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi
+# If either C++ or Python changed, we must install the C++ libraries
+- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
 script:
 - if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi
 - $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh
@@ -81,7 +82,8 @@ matrix:
 - ARROW_BUILD_WARNING_LEVEL=CHECKIN
 before_script:
 - if [ $ARROW_CI_PYTHON_AFFECTED != "1" ]; then exit; fi
-- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then travis_wait 50 
$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi
+# If either C++ or Python changed, we must install the C++ libraries
+- travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
 script:
 - if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi
 - $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh
diff --git a/ci/travis_build_parquet_cpp.sh b/ci/travis_build_parquet_cpp.sh
index fc4ae72c1..7d2e3ab73 100755
--- a/ci/travis_build_parquet_cpp.sh
+++ b/ci/travis_build_parquet_cpp.sh
@@ -22,9 +22,8 @@ source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh
 
 source $TRAVIS_BUILD_DIR/ci/travis_install_toolchain.sh
 
-export PARQUET_ARROW_VERSION=$(git rev-parse HEAD)
-
 export PARQUET_BUILD_TOOLCHAIN=$CPP_TOOLCHAIN
+export ARROW_HOME=$ARROW_CPP_INSTALL
 
 PARQUET_DIR=$TRAVIS_BUILD_DIR/parquet
 mkdir -p $PARQUET_DIR
diff --git a/python/README.md b/python/README.md
index 38c994013..e2ed9db6f 100644
--- a/python/README.md
+++ b/python/README.md
@@ -19,9 +19,9 @@
 
 ## Python library for Apache Arrow
 
-This library provides a Pythonic API wrapper for the reference Arrow C++
-implementation, along with tools for interoperability with pandas, NumPy, and
-other traditional Python scientific computing packages.
+This library provides a Python API for functionality provided by the Arrow C++
+libraries, along with tools for Arrow integration and interoperability with
+pandas, NumPy, and other software in the Python ecosystem.
 
 ## Installing
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI
> ---
>
> Key: ARROW-2175
> URL: https://issues.apache.org/jira/browse/ARROW-2175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
> related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2175.
-
Resolution: Fixed

Issue resolved by pull request 1630
[https://github.com/apache/arrow/pull/1630]

> [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI
> ---
>
> Key: ARROW-2175
> URL: https://issues.apache.org/jira/browse/ARROW-2175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
> related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2185:
---

 Summary: Remove CI directives from squashed commit messages
 Key: ARROW-2185
 URL: https://issues.apache.org/jira/browse/ARROW-2185
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney
 Fix For: 0.9.0


In our PR squash tool, we are potentially picking up CI directives like {{[skip 
appveyor]}} from intermediate commits. We should regex these away and instead 
use directives in the PR title if we wish the commit to master to behave in a 
certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369646#comment-16369646
 ] 

ASF GitHub Bot commented on ARROW-2175:
---

wesm commented on issue #1630: ARROW-2175: [Python] Install Arrow libraries in 
Travis CI builds when only Python directory is affected
URL: https://github.com/apache/arrow/pull/1630#issuecomment-366847597
 
 
   +1, verified from build logs that all is well. This does not impact Windows


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI
> ---
>
> Key: ARROW-2175
> URL: https://issues.apache.org/jira/browse/ARROW-2175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
> related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369640#comment-16369640
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

wesm closed pull request #1581: ARROW-2121: [Python] Handle object arrays 
directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/README-benchmarks.md b/python/README-benchmarks.md
index 3fecb35cb..60fa88f4a 100644
--- a/python/README-benchmarks.md
+++ b/python/README-benchmarks.md
@@ -41,8 +41,6 @@ First you have to install ASV's development version:
 pip install git+https://github.com/airspeed-velocity/asv.git
 ```
 
-
-
 Then you need to set up a few environment variables:
 
 ```shell
diff --git a/python/benchmarks/convert_pandas.py 
b/python/benchmarks/convert_pandas.py
index c4a7a59cb..244b3dcc8 100644
--- a/python/benchmarks/convert_pandas.py
+++ b/python/benchmarks/convert_pandas.py
@@ -48,3 +48,23 @@ def setup(self, n, dtype):
 
 def time_to_series(self, n, dtype):
 self.arrow_data.to_pandas()
+
+
+class ZeroCopyPandasRead(object):
+
+def setup(self):
+# Transpose to make column-major
+values = np.random.randn(10, 10)
+
+df = pd.DataFrame(values.T)
+ctx = pa.default_serialization_context()
+
+self.serialized = ctx.serialize(df)
+self.as_buffer = self.serialized.to_buffer()
+self.as_components = self.serialized.to_components()
+
+def time_deserialize_from_buffer(self):
+pa.deserialize(self.as_buffer)
+
+def time_deserialize_from_components(self):
+pa.deserialize_components(self.as_components)
diff --git a/python/doc/source/ipc.rst b/python/doc/source/ipc.rst
index 9bf93ffe8..bce8b1ed1 100644
--- a/python/doc/source/ipc.rst
+++ b/python/doc/source/ipc.rst
@@ -317,9 +317,8 @@ An object can be reconstructed from its component-based 
representation using
 Serializing pandas Objects
 --
 
-We provide a serialization context that has optimized handling of pandas
-objects like ``DataFrame`` and ``Series``. This can be created with
-``pyarrow.pandas_serialization_context()``. Combined with component-based
+The default serialization context has optimized handling of pandas
+objects like ``DataFrame`` and ``Series``. Combined with component-based
 serialization above, this enables zero-copy transport of pandas DataFrame
 objects not containing any Python objects:
 
@@ -327,7 +326,7 @@ objects not containing any Python objects:
 
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
-   context = pa.pandas_serialization_context()
+   context = pa.default_serialization_context()
serialized_df = context.serialize(df)
df_components = serialized_df.to_components()
original_df = context.deserialize_components(df_components)
diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py
index d95954ed3..15a37ca10 100644
--- a/python/pyarrow/__init__.py
+++ b/python/pyarrow/__init__.py
@@ -125,7 +125,6 @@
 localfs = LocalFileSystem.get_instance()
 
 from pyarrow.serialization import (default_serialization_context,
-   pandas_serialization_context,
register_default_serialization_handlers,
register_torch_serialization_handlers)
 
diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index e8fa83fe7..6d4bf5e78 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -27,7 +27,7 @@
 import six
 
 import pyarrow as pa
-from pyarrow.compat import PY2, zip_longest  # noqa
+from pyarrow.compat import builtin_pickle, PY2, zip_longest  # noqa
 
 
 def infer_dtype(column):
@@ -424,11 +424,19 @@ def dataframe_to_serialized_dict(frame):
 block_data.update(dictionary=values.categories,
   ordered=values.ordered)
 values = values.codes
-
 block_data.update(
 placement=block.mgr_locs.as_array,
 block=values
 )
+
+# If we are dealing with an object array, pickle it instead. Note that
+# we do not use isinstance here because _int.CategoricalBlock is a
+# subclass of _int.ObjectBlock.
+if type(block) == _int.ObjectBlock:
+block_data['object'] = None
+block_data['block'] = builtin_pickle.dumps(
+values, protocol=builtin_pickle.HIGHEST_PROTOCOL)
+
 blocks.append(block_data)
 
 return {
@@ -463,6 +471,9 @@ def _reconstruct_block(item):
 block = _int.make_block(block_arr, placement=placement,

[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369638#comment-16369638
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays 
directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#issuecomment-366845422
 
 
   Merging this, since the last Appveyor build had passed


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2121.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1581
[https://github.com/apache/arrow/pull/1581]

> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369616#comment-16369616
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

robertnishihara commented on issue #1581: ARROW-2121: [Python] Handle object 
arrays directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#issuecomment-366835235
 
 
   Thanks @wesm I *think* I've enabled it now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-19 Thread Atul Dambalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369612#comment-16369612
 ] 

Atul Dambalkar commented on ARROW-1780:
---

Based on above comments, I will update the API with necessary parameters. It 
will be more or less like pagination.

> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Priority: Major
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2184:
---

 Summary: [C++] Add static ctor for FileOutputStream returning 
shared_ptr to base OutputStream
 Key: ARROW-2184
 URL: https://issues.apache.org/jira/browse/ARROW-2184
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


It would be useful for most IO ctors to return pointers to the base interface 
that they implement rather than the subclass. Whether we deprecate the current 
ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369606#comment-16369606
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays 
directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#issuecomment-366831288
 
 
   @robertnishihara would you mind enabling appveyor on your fork when you have 
a chance? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369605#comment-16369605
 ] 

Wes McKinney commented on ARROW-2142:
-

It sounds like we will need to write a function that combines a sequence of 
chunked arrays into a struct, where each of the arrays possibly has a different 
chunked layout. So something like

{{NestChunkedArrays(fields, chunked_arrays, )}} 

(or some other such name, actually kind of hard to name this operation). The 
result would be another {{ChunkedArray}}. The implementation will be like 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/table.h#L223, where 
we convert possibly chunked columns into a sequence of record batches, each of 
whose fields is non-chunked

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement  conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369597#comment-16369597
 ] 

ASF GitHub Bot commented on ARROW-2179:
---

wesm commented on issue #1631: ARROW-2179: [C++] Install omitted headers in 
arrow/util
URL: https://github.com/apache/arrow/pull/1631#issuecomment-366824413
 
 
   Done. This solution won't work for other directories which have some headers 
which should not be installed. See 
https://issues.apache.org/jira/browse/ARROW-2183


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2183) [C++] Add helper CMake function for globbing the right header files

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2183:
---

 Summary: [C++] Add helper CMake function for globbing the right 
header files 
 Key: ARROW-2183
 URL: https://issues.apache.org/jira/browse/ARROW-2183
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


Brought up by discussion in https://github.com/apache/arrow/pull/1631 on 
ARROW-2179. We should collect header files but do not install ones containing 
particular patterns for non-public headers, like {{-internal}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369595#comment-16369595
 ] 

ASF GitHub Bot commented on ARROW-2179:
---

wesm commented on a change in pull request #1631: ARROW-2179: [C++] Install 
omitted headers in arrow/util
URL: https://github.com/apache/arrow/pull/1631#discussion_r169187408
 
 

 ##
 File path: cpp/src/arrow/util/CMakeLists.txt
 ##
 @@ -33,15 +33,17 @@ install(FILES
   compression_zstd.h
   cpu-info.h
   decimal.h
-  hash-util.h
   hash.h
+  hash-util.h
+  io-util.h
 
 Review comment:
   Yeah, that would be better, taking a look


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2182) [Python] ASV benchmark setup does not account for C++ library changing

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2182:
---

 Summary: [Python] ASV benchmark setup does not account for C++ 
library changing
 Key: ARROW-2182
 URL: https://issues.apache.org/jira/browse/ARROW-2182
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


See https://github.com/apache/arrow/blob/master/python/README-benchmarks.md

Perhaps we could create a helper script that will run all the currently-defined 
benchmarks for a specific commit, and ensure that we are running against 
pristine, up-to-date release builds of Arrow (and any other dependencies, like 
parquet-cpp) at that commit? 

cc [~pitrou]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369591#comment-16369591
 ] 

ASF GitHub Bot commented on ARROW-2179:
---

cpcloud commented on a change in pull request #1631: ARROW-2179: [C++] Install 
omitted headers in arrow/util
URL: https://github.com/apache/arrow/pull/1631#discussion_r169186014
 
 

 ##
 File path: cpp/src/arrow/util/CMakeLists.txt
 ##
 @@ -33,15 +33,17 @@ install(FILES
   compression_zstd.h
   cpu-info.h
   decimal.h
-  hash-util.h
   hash.h
+  hash-util.h
+  io-util.h
 
 Review comment:
   Looks like `memory.h` is also missing


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369589#comment-16369589
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on issue #1619: ARROW-2162: [Python/C++] Decimal Values with 
too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#issuecomment-366821392
 
 
   @pitrou I'm now allowing truncation if there's no loss of data (i.e., 
division of the underlying integer by the change in scale has no remainder, if 
the change in scale is negative) and added a test for overflow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369590#comment-16369590
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays 
directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#issuecomment-366821860
 
 
   Sorry for the delay, looking now, and may as well add a benchmark for 
zero-copy DataFrame while I'm at it


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369588#comment-16369588
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on issue #1619: ARROW-2162: [Python/C++] Decimal Values with 
too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#issuecomment-366821392
 
 
   @pitrou I'm now allowing truncation if there's no loss of data (i.e., 
division of the underlying integer by the change in scale has no remainder) and 
added a test for overflow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2181:
---

 Summary: [Python] Add concat_tables to API reference, add 
documentation on use
 Key: ARROW-2181
 URL: https://issues.apache.org/jira/browse/ARROW-2181
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


This omission of documentation was mentioned on the mailing list on February 
13. The documentation should illustrate the contrast between 
{{Table.from_batches}} and {{concat_tables}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2179:
--
Labels: pull-request-available  (was: )

> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369582#comment-16369582
 ] 

ASF GitHub Bot commented on ARROW-2179:
---

wesm opened a new pull request #1631: ARROW-2179: [C++] Install omitted headers 
in arrow/util
URL: https://github.com/apache/arrow/pull/1631
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2179:
---

Assignee: Wes McKinney

> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2175:
--
Labels: pull-request-available  (was: )

> [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI
> ---
>
> Key: ARROW-2175
> URL: https://issues.apache.org/jira/browse/ARROW-2175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
> related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369577#comment-16369577
 ] 

ASF GitHub Bot commented on ARROW-2175:
---

wesm opened a new pull request #1630: ARROW-2175: [Python] Install Arrow 
libraries in Travis CI builds when only Python directory is affected
URL: https://github.com/apache/arrow/pull/1630
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI
> ---
>
> Key: ARROW-2175
> URL: https://issues.apache.org/jira/browse/ARROW-2175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
> related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2175) [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2175:
---

Assignee: Wes McKinney

> [Python] arrow_ep build is triggering during parquet-cpp build in Travis CI
> ---
>
> Key: ARROW-2175
> URL: https://issues.apache.org/jira/browse/ARROW-2175
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> see e.g. https://travis-ci.org/apache/arrow/jobs/342781531#L5546. This may be 
> related to upstream changes in Parquet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2180:
---

 Summary: [C++] Remove APIs deprecated in 0.8.0 release
 Key: ARROW-2180
 URL: https://issues.apache.org/jira/browse/ARROW-2180
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369560#comment-16369560
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169178775
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -854,22 +854,56 @@ static const Decimal128 ScaleMultipliers[] = {
 Decimal128("10"),
 Decimal128("100")};
 
+static bool RescaleWouldCauseDataLoss(const Decimal128& value, int32_t 
delta_scale,
+  int32_t abs_delta_scale) {
+  if (delta_scale < 0) {
+Decimal128 remainder;
+Decimal128 result;
+Decimal128 multiplier(ScaleMultipliers[abs_delta_scale]);
+DCHECK_NE(multiplier, 0);
+Status status = value.Divide(multiplier, , );
+DCHECK(status.ok()) << status.message();
+return remainder != 0;
+  }
+  return false;
+}
+
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
   DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(original_scale, new_scale) << "original_scale == new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
   const int32_t abs_delta_scale = std::abs(delta_scale);
+
   DCHECK_GE(abs_delta_scale, 1);
   DCHECK_LE(abs_delta_scale, 38);
 
+  // Fail if rescaling would lose information
+  const bool rescale_would_cause_data_loss =
+  RescaleWouldCauseDataLoss(*this, delta_scale, abs_delta_scale);
+  if (delta_scale < 0 && rescale_would_cause_data_loss) {
+std::stringstream buf;
+buf << "Rescaling decimal value from original scale of " << original_scale
+<< " to new scale of " << new_scale << " would truncate the value";
+return Status::Invalid(buf.str());
+  }
+
   const Decimal128 scale_multiplier = ScaleMultipliers[abs_delta_scale];
-  const Decimal128 result = *this * scale_multiplier;
 
-  if (ARROW_PREDICT_FALSE(result < *this)) {
+  Decimal128 result(*this);
+
+  if (delta_scale < 0) {
+result /= scale_multiplier;
+  } else {
+result *= scale_multiplier;
+  }
+
+  // Fail if we overflow
+  if (ARROW_PREDICT_FALSE(result < *this && rescale_would_cause_data_loss)) {
 
 Review comment:
   This isn't handling overflow properly. I'll add a test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2172) [Python] Incorrect conversion from Numpy array when stride % itemsize != 0

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369554#comment-16369554
 ] 

ASF GitHub Bot commented on ARROW-2172:
---

cpcloud commented on a change in pull request #1628: ARROW-2172: [C++/Python] 
Fix converting from Numpy array with non-natural stride
URL: https://github.com/apache/arrow/pull/1628#discussion_r169177655
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -554,12 +554,22 @@ Status StaticCastBuffer(const Buffer& input, const 
int64_t length, MemoryPool* p
   return Status::OK();
 }
 
-template 
-void CopyStrided(T* input_data, int64_t length, int64_t stride, T2* 
output_data) {
+template 
+void CopyStridedBytewise(int8_t* input_data, int64_t length, int64_t stride,
+ T* output_data) {
+  // Passing input_data as non-const is a concession to PyObject*
+  for (int64_t i = 0; i < length; ++i) {
+memcpy(output_data + i, input_data, sizeof(T));
+input_data += stride;
 
 Review comment:
   Is `stride >= sizeof(T)` guaranteed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Incorrect conversion from Numpy array when stride % itemsize != 0
> --
>
> Key: ARROW-2172
> URL: https://issues.apache.org/jira/browse/ARROW-2172
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> In the example below, the input array has a stride that's not a multiple of 
> the itemsize:
> {code:python}
> >>> data = np.array([(42, True), (43, False)],
> ...:dtype=[('x', np.int32), ('y', np.bool_)])
> ...:
> ...:
> >>> data['x']
> array([42, 43], dtype=int32)
> >>> pa.array(data['x'], type=pa.int32())
> 
> [
>   42,
>   11009
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2153) [C++] Decimal conversion not working for exponential notation

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369546#comment-16369546
 ] 

ASF GitHub Bot commented on ARROW-2153:
---

cpcloud commented on issue #1618: ARROW-2153/ARROW-2160: [C++/Python]  Fix 
decimal precision inference
URL: https://github.com/apache/arrow/pull/1618#issuecomment-366808695
 
 
   In gcc 4.8 that is. 4.9 may be implemented, but I don't think it supports 
named capture groups.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Decimal conversion not working for exponential notation
> -
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2153) [C++] Decimal conversion not working for exponential notation

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369542#comment-16369542
 ] 

ASF GitHub Bot commented on ARROW-2153:
---

cpcloud commented on issue #1618: ARROW-2153/ARROW-2160: [C++/Python]  Fix 
decimal precision inference
URL: https://github.com/apache/arrow/pull/1618#issuecomment-366808528
 
 
   Yep, I believe `std::regex_match` is implemented as `return false;` :(


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Decimal conversion not working for exponential notation
> -
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2153) [C++] Decimal conversion not working for exponential notation

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369540#comment-16369540
 ] 

ASF GitHub Bot commented on ARROW-2153:
---

wesm commented on issue #1618: ARROW-2153/ARROW-2160: [C++/Python]  Fix decimal 
precision inference
URL: https://github.com/apache/arrow/pull/1618#issuecomment-366808204
 
 
   `std::regex` is totally broken in gcc 4.8.x (what we're using for conda/pip 
releases) AFAIK so using `` isn't even an option right now. When we get 
past gcc 4.8 it might be nice to use the STL regexen


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Decimal conversion not working for exponential notation
> -
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2153) [C++] Decimal conversion not working for exponential notation

2018-02-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2153:
--
Labels: pull-request-available  (was: )

> [C++] Decimal conversion not working for exponential notation
> -
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2179:

Fix Version/s: 0.9.0

> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Priority: Minor
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369535#comment-16369535
 ] 

Wes McKinney commented on ARROW-2179:
-

You're right, it's not being installed in {{util/CMakeLists.txt}}

> [C++] arrow/util/io-util.h missing from libarrow-dev
> 
>
> Key: ARROW-2179
> URL: https://issues.apache.org/jira/browse/ARROW-2179
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Rares Vernica
>Priority: Minor
> Fix For: 0.9.0
>
>
> {{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
> (ubuntu/trusty): 
> {code:java}
> > ls -1 /usr/include/arrow/util/
> bit-stream-utils.h
> bit-util.h
> bpacking.h
> compiler-util.h
> compression.h
> compression_brotli.h
> compression_lz4.h
> compression_snappy.h
> compression_zlib.h
> compression_zstd.h
> cpu-info.h
> decimal.h
> hash-util.h
> hash.h
> key_value_metadata.h
> logging.h
> macros.h
> parallel.h
> rle-encoding.h
> sse-util.h
> stl.h
> type_traits.h
> variant
> variant.h
> visibility.h
> {code}
> {code:java}
> > apt-cache show libarrow-dev
> Package: libarrow-dev
> Architecture: amd64
> Version: 0.8.0-2
> Multi-Arch: same
> Priority: optional
> Section: libdevel
> Source: apache-arrow
> Maintainer: Kouhei Sutou 
> Installed-Size: 5696
> Depends: libarrow0 (= 0.8.0-2)
> Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
> Size: 602716
> MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
> SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
> SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
> SHA512: 
> 99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
> Homepage: https://arrow.apache.org/
> Description: Apache Arrow is a data processing library for analysis
>  .
>  This package provides header files.
> Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369524#comment-16369524
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169171377
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -167,5 +167,14 @@ TEST(BuiltinConversionTest, TestMixedTypeFails) {
   ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, ));
 }
 
+TEST_F(DecimalTest, FromPythonDecimalRescale) {
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.0134"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
 
 Review comment:
   I see :) Sure


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2179) [C++] arrow/util/io-util.h missing from libarrow-dev

2018-02-19 Thread Rares Vernica (JIRA)
Rares Vernica created ARROW-2179:


 Summary: [C++] arrow/util/io-util.h missing from libarrow-dev
 Key: ARROW-2179
 URL: https://issues.apache.org/jira/browse/ARROW-2179
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Rares Vernica


{{arrow/util/io-util.h}} is missing from the {{libarow-dev}} package 
(ubuntu/trusty): 
{code:java}
> ls -1 /usr/include/arrow/util/
bit-stream-utils.h
bit-util.h
bpacking.h
compiler-util.h
compression.h
compression_brotli.h
compression_lz4.h
compression_snappy.h
compression_zlib.h
compression_zstd.h
cpu-info.h
decimal.h
hash-util.h
hash.h
key_value_metadata.h
logging.h
macros.h
parallel.h
rle-encoding.h
sse-util.h
stl.h
type_traits.h
variant
variant.h
visibility.h
{code}

{code:java}
> apt-cache show libarrow-dev
Package: libarrow-dev
Architecture: amd64
Version: 0.8.0-2
Multi-Arch: same
Priority: optional
Section: libdevel
Source: apache-arrow
Maintainer: Kouhei Sutou 
Installed-Size: 5696
Depends: libarrow0 (= 0.8.0-2)
Filename: pool/trusty/universe/a/apache-arrow/libarrow-dev_0.8.0-2_amd64.deb
Size: 602716
MD5sum: de5f2bfafd90ff29e4b192f4e5d26605
SHA1: e3d9146b30f07c07b62f8bdf9f779d0ee5d05a75
SHA256: 30a89b2ac6845998f22434e660b1a7c9d91dc8b2ba947e1f4333b3cf74c69982
SHA512: 
99f511bee6645a68708848a58b4eba669a2ec8c45fb411c56ed2c920d3ff34552c77821eff7e428c886d16e450bdd25cc4e67597972f77a4255f302a56d1eac8
Homepage: https://arrow.apache.org/
Description: Apache Arrow is a data processing library for analysis
 .
 This package provides header files.
Description-md5: e4855d5dbadacb872bf8c4ca67f624e3
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369516#comment-16369516
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

pitrou commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169170397
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -167,5 +167,14 @@ TEST(BuiltinConversionTest, TestMixedTypeFails) {
   ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, ));
 }
 
+TEST_F(DecimalTest, FromPythonDecimalRescale) {
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.0134"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
 
 Review comment:
   Yes, I meant it would fail because it refuses to truncate. I'm just 
suggesting to add a comment to make it clearer to the reader.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369515#comment-16369515
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

pitrou commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169170314
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -857,19 +857,31 @@ static const Decimal128 ScaleMultipliers[] = {
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
   DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(original_scale, new_scale) << "original_scale == new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
+
+  // Fail if rescaling would truncate
 
 Review comment:
   Yes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369513#comment-16369513
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169170225
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -857,19 +857,31 @@ static const Decimal128 ScaleMultipliers[] = {
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
   DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(original_scale, new_scale) << "original_scale == new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
+
+  // Fail if rescaling would truncate
 
 Review comment:
   Meaning something like `decimal128(10, 2)` is the requested type but the 
input is something like `1.000` (scale of 3, but all zeros).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369514#comment-16369514
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169170225
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -857,19 +857,31 @@ static const Decimal128 ScaleMultipliers[] = {
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
   DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(original_scale, new_scale) << "original_scale == new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
+
+  // Fail if rescaling would truncate
 
 Review comment:
   Meaning something like `decimal128(10, 2)` is the requested type but the 
input is something like `1.000` (scale of 3, but all zeros)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369511#comment-16369511
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169170046
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -167,5 +167,14 @@ TEST(BuiltinConversionTest, TestMixedTypeFails) {
   ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, ));
 }
 
+TEST_F(DecimalTest, FromPythonDecimalRescale) {
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.0134"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
 
 Review comment:
   No, this would fail. I'm asserting that it returns an `Invalid` status code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2153) [C++] Decimal conversion not working for exponential notation

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2153:
-
Summary: [C++] Decimal conversion not working for exponential notation  
(was: decimal conversion not working for exponential notation)

> [C++] Decimal conversion not working for exponential notation
> -
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369503#comment-16369503
 ] 

ASF GitHub Bot commented on ARROW-2160:
---

cpcloud commented on issue #1618: ARROW-2160: [C++/Python]  Fix decimal 
precision inference
URL: https://github.com/apache/arrow/pull/1618#issuecomment-366798968
 
 
   @wesm @pitrou I reintroduced boost regex for parsing the decimal number, to 
address ARROW-2153. We were using a fairly straightforward algorithm before 
that implemented the regular expression by hand but it didn't allow numbers 
like `1e1`. The regular expression to match that would be very complex and hard 
to read if written by hand so I decided to use boost regex. I didn't go with 
STL library because it doesn't support named capture groups, which I think make 
the code much more readable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python]  Fix decimal precision inference
> -
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369494#comment-16369494
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

pitrou commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169168022
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -857,19 +857,31 @@ static const Decimal128 ScaleMultipliers[] = {
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
   DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(original_scale, new_scale) << "original_scale == new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
+
+  // Fail if rescaling would truncate
 
 Review comment:
   Is it possible that all truncated digits be zero?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369492#comment-16369492
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

pitrou commented on a change in pull request #1619: ARROW-2162: [Python/C++] 
Decimal Values with too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#discussion_r169167969
 
 

 ##
 File path: cpp/src/arrow/python/python-test.cc
 ##
 @@ -167,5 +167,14 @@ TEST(BuiltinConversionTest, TestMixedTypeFails) {
   ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, ));
 }
 
+TEST_F(DecimalTest, FromPythonDecimalRescale) {
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.0134"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
 
 Review comment:
   I guess this would truncate? Perhaps add a comment? :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369489#comment-16369489
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud commented on issue #1619: ARROW-2162: [Python/C++] Decimal Values with 
too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619#issuecomment-366796399
 
 
   @wesm @pitrou any comments here? otherwise, this is ready to go on my end


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-19 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2162:
--
Labels: pull-request-available  (was: )

> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369463#comment-16369463
 ] 

ASF GitHub Bot commented on ARROW-1579:
---

BryanCutler commented on issue #1319: ARROW-1579: [Java] Adding containerized 
Spark Integration tests
URL: https://github.com/apache/arrow/pull/1319#issuecomment-366789122
 
 
   Thanks @wesm @xhochy and @felixcheung !  Since it can sometimes take a while 
to get Spark updated, ff we get to the point were this is ready to be put in 
the nightly builds, maybe I could submit a PR to patch Spark and we could 
configure the docker build to point to that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add dockerized test setup to validate Spark integration
> --
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-19 Thread Atul Dambalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369459#comment-16369459
 ] 

Atul Dambalkar commented on ARROW-1780:
---

Comment from Jacques Nadeau on Slack channel -

Nice start @atul_dambalkar. It would be good to add the ability to set the 
amount to return per call as opposed to trying to deplete the whole dataset in 
one call.

 

> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Priority: Major
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-19 Thread Atul Dambalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369457#comment-16369457
 ] 

Atul Dambalkar edited comment on ARROW-1780 at 2/19/18 7:51 PM:


Comments from Uwe Korn on Slack channel - 

My main plan was to make JDBC drivers accessible very fast from Python / Pandas 
programs. Currently, you either have the option for most DBs to either use 
ODBC/python-native drivers that are quite often awful or use JDBC ones but have 
a high cost of serialization between the JVM and the Python objects. By using 
Arrow, we should be able to use the good JDBC drivers from Python without the 
normal serialization overhead.

We’re looking at SQL-engines that work on distributed filesystem in general at 
the moment (Apache Drill and Presto are two best candidates at the moment) and 
the common pattern is that they have good JDBC drivers but the other connectors 
are not so well maintained or really slow. Currently, Presto is the one of 
biggest interest for me.

For me it seems that having a JDBC<->Arrow adapter already yields a significant 
performance impact in comparison to the current situation. And it will also 
give the speedup independent of the underlying DB.

 


was (Author: atul_dambalkar):
Comments from Uwe Korn - 

My main plan was to make JDBC drivers accessible very fast from Python / Pandas 
programs. Currently, you either have the option for most DBs to either use 
ODBC/python-native drivers that are quite often awful or use JDBC ones but have 
a high cost of serialization between the JVM and the Python objects. By using 
Arrow, we should be able to use the good JDBC drivers from Python without the 
normal serialization overhead.

We’re looking at SQL-engines that work on distributed filesystem in general at 
the moment (Apache Drill and Presto are two best candidates at the moment) and 
the common pattern is that they have good JDBC drivers but the other connectors 
are not so well maintained or really slow. Currently, Presto is the one of 
biggest interest for me.

For me it seems that having a JDBC<->Arrow adapter already yields a significant 
performance impact in comparison to the current situation. And it will also 
give the speedup independent of the underlying DB.

 

> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Priority: Major
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-19 Thread Atul Dambalkar (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369458#comment-16369458
 ] 

Atul Dambalkar commented on ARROW-1780:
---

I have put together some very basic interface for the JDBC Adapter - so far by 
forking Arrow 
(https://github.com/atuldambalkar/arrow/tree/master/java/adapter/jdbc). I had a 
brief discussion with Uwe earlier on this on Slack, so wanted to get some more 
views on this and also not to redo or overstep. At this time, I have one API in 
the adapter which can return Arrow Vector objects after executing SQL query on 
the given JDBC connection object - VectorSchemaRoot sqlToArrow(Connection 
connection, String query).

One more possible interface could be to fetch a certain number of records from 
all the tables from the SQL database and build Arrow objects for that. The API 
can of-course be implemented lazily and only when the data for a particular 
table is requested.

> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Priority: Major
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-19 Thread Atul Dambalkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Atul Dambalkar updated ARROW-1780:
--

Comments from Uwe Korn - 

My main plan was to make JDBC drivers accessible very fast from Python / Pandas 
programs. Currently, you either have the option for most DBs to either use 
ODBC/python-native drivers that are quite often awful or use JDBC ones but have 
a high cost of serialization between the JVM and the Python objects. By using 
Arrow, we should be able to use the good JDBC drivers from Python without the 
normal serialization overhead.

We’re looking at SQL-engines that work on distributed filesystem in general at 
the moment (Apache Drill and Presto are two best candidates at the moment) and 
the common pattern is that they have good JDBC drivers but the other connectors 
are not so well maintained or really slow. Currently, Presto is the one of 
biggest interest for me.

For me it seems that having a JDBC<->Arrow adapter already yields a significant 
performance impact in comparison to the current situation. And it will also 
give the speedup independent of the underlying DB.

 

> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Priority: Major
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2159) [JS] Support custom predicates

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369436#comment-16369436
 ] 

ASF GitHub Bot commented on ARROW-2159:
---

TheNeuralBit closed pull request #1616: ARROW-2159: [JS] Support custom 
predicates
URL: https://github.com/apache/arrow/pull/1616
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/js/src/Arrow.externs.js b/js/src/Arrow.externs.js
index 21dca8be8..de1e65392 100644
--- a/js/src/Arrow.externs.js
+++ b/js/src/Arrow.externs.js
@@ -70,6 +70,7 @@ CountByResult.prototype.asJSON;
 
 var col = function () {};
 var lit = function () {};
+var custom = function () {};
 
 var Value = function() {};
 /** @type {?} */
@@ -738,4 +739,4 @@ VectorVisitor.prototype.visitInterval;
 /** @type {?} */
 VectorVisitor.prototype.visitFixedSizeList;
 /** @type {?} */
-VectorVisitor.prototype.visitMap;
\ No newline at end of file
+VectorVisitor.prototype.visitMap;
diff --git a/js/src/Arrow.ts b/js/src/Arrow.ts
index df37a8fb0..4a0a2ac6d 100644
--- a/js/src/Arrow.ts
+++ b/js/src/Arrow.ts
@@ -168,6 +168,7 @@ export namespace view {
 export namespace predicate {
 export import col = predicate_.col;
 export import lit = predicate_.lit;
+export import custom = predicate_.custom;
 
 export import Or = predicate_.Or;
 export import Col = predicate_.Col;
diff --git a/js/src/predicate.ts b/js/src/predicate.ts
index 981ffb166..b177b4fa7 100644
--- a/js/src/predicate.ts
+++ b/js/src/predicate.ts
@@ -222,5 +222,19 @@ export class GTeq extends ComparisonPredicate {
 }
 }
 
+export class CustomPredicate extends Predicate {
+constructor(private next: PredicateFunc, private bind_: (batch: 
RecordBatch) => void) {
+super();
+}
+
+bind(batch: RecordBatch) {
+this.bind_(batch);
+return this.next;
+}
+}
+
 export function lit(v: any): Value { return new Literal(v); }
 export function col(n: string): Col { return new Col(n); }
+export function custom(next: PredicateFunc, bind: (batch: RecordBatch) => 
void) {
+return new CustomPredicate(next, bind);
+}
diff --git a/js/test/unit/table-tests.ts b/js/test/unit/table-tests.ts
index ffcc8f477..8a433815d 100644
--- a/js/test/unit/table-tests.ts
+++ b/js/test/unit/table-tests.ts
@@ -15,11 +15,11 @@
 // specific language governing permissions and limitations
 // under the License.
 
-import Arrow from '../Arrow';
+import Arrow, { RecordBatch } from '../Arrow';
 
 const { predicate, Table } = Arrow;
 
-const { col, lit } = predicate;
+const { col, lit, custom } = predicate;
 
 const F32 = 0, I32 = 1, DICT = 2;
 const test_data = [
@@ -323,6 +323,7 @@ describe(`Table`, () => {
 expect(table.getColumnIndex('f32')).toEqual(F32);
 expect(table.getColumnIndex('dictionary')).toEqual(DICT);
 });
+let get_i32: (idx: number) => number, get_f32: (idx: number) => 
number;
 const filter_tests = [
 {
 name: `filter on f32 >= 0`,
@@ -364,6 +365,15 @@ describe(`Table`, () => {
 name: `filter on f32 <= i32`,
 filtered: table.filter(col('f32').lteq(col('i32'))),
 expected: values.filter((row) => row[F32] <= row[I32])
+}, {
+name: `filter on f32*i32 > 0 (custom predicate)`,
+filtered: table.filter(custom(
+(idx: number) => (get_f32(idx) * get_i32(idx) > 0),
+(batch: RecordBatch) => {
+get_f32 = col('f32').bind(batch);
+get_i32 = col('i32').bind(batch);
+})),
+expected: values.filter((row) => (row[F32] as number) * 
(row[I32] as number) > 0)
 }
 ];
 for (let this_test of filter_tests) {


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] Support custom predicates
> --
>
> Key: ARROW-2159
> URL: https://issues.apache.org/jira/browse/ARROW-2159
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.3.0
>
>
> Right now the 

[jira] [Resolved] (ARROW-2159) [JS] Support custom predicates

2018-02-19 Thread Brian Hulette (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette resolved ARROW-2159.
--
   Resolution: Fixed
Fix Version/s: JS-0.3.0

Issue resolved by pull request 1616
[https://github.com/apache/arrow/pull/1616]

> [JS] Support custom predicates
> --
>
> Key: ARROW-2159
> URL: https://issues.apache.org/jira/browse/ARROW-2159
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.3.0
>
>
> Right now the `DataFrame` interface only supports a pretty basic set of 
> operations, which could be limiting to users. We should add the ability for 
> the user to define their own predicates using callback functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2168) [C++] Build toolchain builds with jemalloc

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369408#comment-16369408
 ] 

ASF GitHub Bot commented on ARROW-2168:
---

wesm commented on issue #1621: ARROW-2168: [C++] Build toolchain on CI with 
jemalloc
URL: https://github.com/apache/arrow/pull/1621#issuecomment-366774573
 
 
   I maybe am mis-remembering, but I thought our idea was exclusively using the 
vendored jemalloc instead of using a toolchain version to avoid symbol 
conflicts or bugs from a version we may not trust?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Build toolchain builds with jemalloc
> --
>
> Key: ARROW-2168
> URL: https://issues.apache.org/jira/browse/ARROW-2168
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have fixed all known problems in the jemalloc 4.x branch and should be 
> able to gradually reactivate it in our builds to get its performance boost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2168) [C++] Build toolchain builds with jemalloc

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369407#comment-16369407
 ] 

ASF GitHub Bot commented on ARROW-2168:
---

wesm closed pull request #1621: ARROW-2168: [C++] Build toolchain on CI with 
jemalloc
URL: https://github.com/apache/arrow/pull/1621
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh
index 4ffe97f67..17b5deb36 100755
--- a/ci/travis_before_script_cpp.sh
+++ b/ci/travis_before_script_cpp.sh
@@ -29,14 +29,6 @@ else
   source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh
 fi
 
-if [ "$ARROW_TRAVIS_USE_TOOLCHAIN" == "1" ]; then
-  # Set up C++ toolchain from conda-forge packages for faster builds
-  source $TRAVIS_BUILD_DIR/ci/travis_install_toolchain.sh
-fi
-
-mkdir -p $ARROW_CPP_BUILD_DIR
-pushd $ARROW_CPP_BUILD_DIR
-
 CMAKE_COMMON_FLAGS="\
 -DARROW_BUILD_BENCHMARKS=ON \
 -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL \
@@ -45,6 +37,15 @@ CMAKE_COMMON_FLAGS="\
 CMAKE_LINUX_FLAGS=""
 CMAKE_OSX_FLAGS=""
 
+if [ "$ARROW_TRAVIS_USE_TOOLCHAIN" == "1" ]; then
+  # Set up C++ toolchain from conda-forge packages for faster builds
+  source $TRAVIS_BUILD_DIR/ci/travis_install_toolchain.sh
+  CMAKE_COMMON_FLAGS="${CMAKE_COMMON_FLAGS} -DARROW_JEMALLOC=ON"
+fi
+
+mkdir -p $ARROW_CPP_BUILD_DIR
+pushd $ARROW_CPP_BUILD_DIR
+
 if [ $only_library_mode == "yes" ]; then
   CMAKE_COMMON_FLAGS="\
 $CMAKE_COMMON_FLAGS \
diff --git a/ci/travis_install_toolchain.sh b/ci/travis_install_toolchain.sh
index e01a084da..60cdc36a2 100755
--- a/ci/travis_install_toolchain.sh
+++ b/ci/travis_install_toolchain.sh
@@ -24,7 +24,7 @@ source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh
 if [ ! -e $CPP_TOOLCHAIN ]; then
 # Set up C++ toolchain from conda-forge packages for faster builds
 conda create -y -q -p $CPP_TOOLCHAIN python=2.7 \
-jemalloc=4.4.0 \
+jemalloc=4.5.0.post \
 nomkl \
 boost-cpp \
 rapidjson \


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Build toolchain builds with jemalloc
> --
>
> Key: ARROW-2168
> URL: https://issues.apache.org/jira/browse/ARROW-2168
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have fixed all known problems in the jemalloc 4.x branch and should be 
> able to gradually reactivate it in our builds to get its performance boost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2168) [C++] Build toolchain builds with jemalloc

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2168.
-
Resolution: Fixed

Issue resolved by pull request 1621
[https://github.com/apache/arrow/pull/1621]

> [C++] Build toolchain builds with jemalloc
> --
>
> Key: ARROW-2168
> URL: https://issues.apache.org/jira/browse/ARROW-2168
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have fixed all known problems in the jemalloc 4.x branch and should be 
> able to gradually reactivate it in our builds to get its performance boost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2178) [JS] Fix JS html FileReader example

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2178.
-
   Resolution: Fixed
Fix Version/s: (was: JS-0.3.0)
   0.9.0

Issue resolved by pull request 1614
[https://github.com/apache/arrow/pull/1614]

>  [JS] Fix JS html FileReader example
> 
>
> Key: ARROW-2178
> URL: https://issues.apache.org/jira/browse/ARROW-2178
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Paul Taylor
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2178) [JS] Fix JS html FileReader example

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369404#comment-16369404
 ] 

ASF GitHub Bot commented on ARROW-2178:
---

wesm closed pull request #1614: ARROW-2178: [JS] Fix JS html FileReader example
URL: https://github.com/apache/arrow/pull/1614
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/js/examples/read_file.html b/js/examples/read_file.html
index 3093622fc..3e082d9dc 100644
--- a/js/examples/read_file.html
+++ b/js/examples/read_file.html
@@ -29,6 +29,7 @@
 }
 table, th, td {
   border: 1px solid black;
+  white-space: nowrap;
 }
 
 

[jira] [Created] (ARROW-2178) [JS] Fix JS html FileReader example

2018-02-19 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-2178:


 Summary:  [JS] Fix JS html FileReader example
 Key: ARROW-2178
 URL: https://issues.apache.org/jira/browse/ARROW-2178
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette
Assignee: Paul Taylor
 Fix For: JS-0.3.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2177:
-
Description: 
Allowing both negative and positive scale makes it ambiguous what the scale of 
a number should be when it using exponential notation, e.g., {{0.01E3}}. Should 
that have a precision of 4 and a scale of 2 since it's specified as 2 points to 
the right of the decimal and it evaluates to 10? Or a precision of 1 and a 
scale of -1?

Current it's the latter, but I think it should be the former.

  was:Allowing both negative and positive scale makes it ambiguous what the 
scale of a number should be when it using exponential notation, e.g., 
{{0.01E3}}. Should that have a precision of 4 and a scale of 2 since it's 
specified as 2 points to the right of the decimal and it evaluates to 10? Or a 
precision of 1 and a scale of -1?


> [C++] Remove support for specifying negative scale values in DecimalType
> 
>
> Key: ARROW-2177
> URL: https://issues.apache.org/jira/browse/ARROW-2177
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> Allowing both negative and positive scale makes it ambiguous what the scale 
> of a number should be when it using exponential notation, e.g., {{0.01E3}}. 
> Should that have a precision of 4 and a scale of 2 since it's specified as 2 
> points to the right of the decimal and it evaluates to 10? Or a precision of 
> 1 and a scale of -1?
> Current it's the latter, but I think it should be the former.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2177:
-
Description: Allowing both negative and positive scale makes it ambiguous 
what the scale of a number should be when it using exponential notation, e.g., 
{{0.01E3}}. Should that have a precision of 4 and a scale of 2 since it's 
specified as 2 points to the right of the decimal and it evaluates to 10? Or a 
precision of 1 and a scale of -1?  (was: Allowing both negative and positive 
scale makes it ambiguous what the scale of a number should be when it using 
exponential notation, e.g., {{0.01E3}}. Should that have a precision of 2 and a 
scale of 2 since it's specified as 2 points to the right of the decimal and it 
evaluates to 10? Or a precision of 1 and a scale of -1?)

> [C++] Remove support for specifying negative scale values in DecimalType
> 
>
> Key: ARROW-2177
> URL: https://issues.apache.org/jira/browse/ARROW-2177
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> Allowing both negative and positive scale makes it ambiguous what the scale 
> of a number should be when it using exponential notation, e.g., {{0.01E3}}. 
> Should that have a precision of 4 and a scale of 2 since it's specified as 2 
> points to the right of the decimal and it evaluates to 10? Or a precision of 
> 1 and a scale of -1?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2177:
-
Description: Allowing both negative and positive scale makes it ambiguous 
what the scale of a number should be when it using exponential notation, e.g., 
{{0.01E3}}. Should that have a precision of 2 and a scale of 2 since it's 
specified as 2 points to the right of the decimal and it evaluates to 10? Or a 
precision of 1 and a scale of -1?

> [C++] Remove support for specifying negative scale values in DecimalType
> 
>
> Key: ARROW-2177
> URL: https://issues.apache.org/jira/browse/ARROW-2177
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> Allowing both negative and positive scale makes it ambiguous what the scale 
> of a number should be when it using exponential notation, e.g., {{0.01E3}}. 
> Should that have a precision of 2 and a scale of 2 since it's specified as 2 
> points to the right of the decimal and it evaluates to 10? Or a precision of 
> 1 and a scale of -1?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2143) [Python] Provide a manylinux1 wheel for cp27m

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369379#comment-16369379
 ] 

ASF GitHub Bot commented on ARROW-2143:
---

wesm closed pull request #1603: ARROW-2143: [Python] Provide a manylinux1 wheel 
for cp27m
URL: https://github.com/apache/arrow/pull/1603
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/manylinux1/Dockerfile-x86_64 
b/python/manylinux1/Dockerfile-x86_64
index 919a32be7..ec520338f 100644
--- a/python/manylinux1/Dockerfile-x86_64
+++ b/python/manylinux1/Dockerfile-x86_64
@@ -14,7 +14,7 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:latest
+FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-2143
 
 ADD arrow /arrow
 WORKDIR /arrow/cpp
diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh
index 6bed451d2..5fd27c8d0 100755
--- a/python/manylinux1/build_arrow.sh
+++ b/python/manylinux1/build_arrow.sh
@@ -25,10 +25,8 @@
 # Build upon the scripts in https://github.com/matthew-brett/manylinux-builds
 # * Copyright (c) 2013-2016, Matt Terry and Matthew Brett (BSD 2-clause)
 
-PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7 3.4 3.5 3.6}"
-
-# Package index with only manylinux1 builds
-MANYLINUX_URL=https://nipy.bic.berkeley.edu/manylinux
+# Build different python versions with various unicode widths
+PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7,16 2.7,32 3.4,16 3.5,16 3.6,16}"
 
 source /multibuild/manylinux_utils.sh
 
@@ -48,40 +46,44 @@ export PYARROW_CMAKE_OPTIONS='-DTHRIFT_HOME=/usr'
 # Ensure the target directory exists
 mkdir -p /io/dist
 
-for PYTHON in ${PYTHON_VERSIONS}; do
-PYTHON_INTERPRETER="$(cpython_path $PYTHON)/bin/python"
-PIP="$(cpython_path $PYTHON)/bin/pip"
-PIPI_IO="$PIP install -f $MANYLINUX_URL"
-PATH="$PATH:$(cpython_path $PYTHON)"
+for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
+IFS=","
+set -- $PYTHON_TUPLE;
+PYTHON=$1
+U_WIDTH=$2
+CPYTHON_PATH="$(cpython_path $PYTHON ${U_WIDTH})"
+PYTHON_INTERPRETER="${CPYTHON_PATH}/bin/python"
+PIP="${CPYTHON_PATH}/bin/pip"
+PATH="$PATH:${CPYTHON_PATH}"
 
 echo "=== (${PYTHON}) Building Arrow C++ libraries ==="
-ARROW_BUILD_DIR=/arrow/cpp/build-PY${PYTHON}
+ARROW_BUILD_DIR=/arrow/cpp/build-PY${PYTHON}-${U_WIDTH}
 mkdir -p "${ARROW_BUILD_DIR}"
 pushd "${ARROW_BUILD_DIR}"
-PATH="$(cpython_path $PYTHON)/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=off 
-DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON 
-DPythonInterp_FIND_VERSION=${PYTHON} -DARROW_PLASMA=ON -DARROW_ORC=ON ..
+PATH="${CPYTHON_PATH}/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF 
-DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=off 
-DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON 
-DPythonInterp_FIND_VERSION=${PYTHON} -DARROW_PLASMA=ON -DARROW_ORC=ON ..
 make -j5 install
 popd
 
 # Clear output directory
 rm -rf dist/
 echo "=== (${PYTHON}) Building wheel ==="
-PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py 
build_ext --inplace --with-parquet --with-static-parquet --bundle-arrow-cpp
-PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py 
bdist_wheel
+PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py build_ext 
--inplace --with-parquet --with-static-parquet --bundle-arrow-cpp
+PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py bdist_wheel
 
 echo "=== (${PYTHON}) Test the existence of optional modules ==="
-$PIPI_IO -r requirements.txt
-PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import 
pyarrow.parquet"
-PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import 
pyarrow.plasma"
+$PIP install -r requirements.txt
+PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER -c "import 
pyarrow.parquet"
+PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER -c "import 
pyarrow.plasma"
 
 echo "=== (${PYTHON}) Tag the wheel with manylinux1 ==="
 mkdir -p repaired_wheels/
 auditwheel -v repair -L . dist/pyarrow-*.whl -w repaired_wheels/
 
 echo "=== (${PYTHON}) Testing manylinux1 wheel ==="
-source /venv-test-${PYTHON}/bin/activate
+source /venv-test-${PYTHON}-${U_WIDTH}/bin/activate
 pip install repaired_wheels/*.whl
 
-py.test -v -r sxX --durations=15 

[jira] [Resolved] (ARROW-2143) [Python] Provide a manylinux1 wheel for cp27m

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2143.
-
Resolution: Fixed

Issue resolved by pull request 1603
[https://github.com/apache/arrow/pull/1603]

> [Python] Provide a manylinux1 wheel for cp27m
> -
>
> Key: ARROW-2143
> URL: https://issues.apache.org/jira/browse/ARROW-2143
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently we only provide it for cp27mu, we should also build them for cp27m



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2171) [Python] OwnedRef is fragile

2018-02-19 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2171.
-
Resolution: Fixed

Issue resolved by pull request 1626
[https://github.com/apache/arrow/pull/1626]

> [Python] OwnedRef is fragile
> 
>
> Key: ARROW-2171
> URL: https://issues.apache.org/jira/browse/ARROW-2171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Some uses of OwnedRef can implicitly invoke its (default) copy constructor, 
> which will lead to extraneous decrefs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2171) [Python] OwnedRef is fragile

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369376#comment-16369376
 ] 

ASF GitHub Bot commented on ARROW-2171:
---

wesm commented on a change in pull request #1626: ARROW-2171: [C++/Python] Make 
OwnedRef safer
URL: https://github.com/apache/arrow/pull/1626#discussion_r169141523
 
 

 ##
 File path: cpp/src/arrow/python/common.h
 ##
 @@ -98,6 +101,10 @@ class ARROW_EXPORT OwnedRef {
 // (e.g. if it is released in the middle of a function for performance reasons)
 class ARROW_EXPORT OwnedRefNoGIL : public OwnedRef {
  public:
+  OwnedRefNoGIL() : OwnedRef() {}
+  OwnedRefNoGIL(OwnedRefNoGIL&& other) : OwnedRef(other.detach()) {}
+  explicit OwnedRefNoGIL(PyObject* obj) : OwnedRef(obj) {}
 
 Review comment:
   Got it, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] OwnedRef is fragile
> 
>
> Key: ARROW-2171
> URL: https://issues.apache.org/jira/browse/ARROW-2171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Some uses of OwnedRef can implicitly invoke its (default) copy constructor, 
> which will lead to extraneous decrefs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2171) [Python] OwnedRef is fragile

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369378#comment-16369378
 ] 

ASF GitHub Bot commented on ARROW-2171:
---

wesm closed pull request #1626: ARROW-2171: [C++/Python] Make OwnedRef safer
URL: https://github.com/apache/arrow/pull/1626
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/common.h b/cpp/src/arrow/python/common.h
index 269385c1a..b2844b18c 100644
--- a/cpp/src/arrow/python/common.h
+++ b/cpp/src/arrow/python/common.h
@@ -67,7 +67,7 @@ class ARROW_EXPORT PyAcquireGIL {
 class ARROW_EXPORT OwnedRef {
  public:
   OwnedRef() : obj_(NULLPTR) {}
-
+  OwnedRef(OwnedRef&& other) : OwnedRef(other.detach()) {}
   explicit OwnedRef(PyObject* obj) : obj_(obj) {}
 
   ~OwnedRef() { reset(); }
@@ -90,6 +90,8 @@ class ARROW_EXPORT OwnedRef {
   PyObject** ref() { return _; }
 
  private:
+  ARROW_DISALLOW_COPY_AND_ASSIGN(OwnedRef);
+
   PyObject* obj_;
 };
 
@@ -98,6 +100,10 @@ class ARROW_EXPORT OwnedRef {
 // (e.g. if it is released in the middle of a function for performance reasons)
 class ARROW_EXPORT OwnedRefNoGIL : public OwnedRef {
  public:
+  OwnedRefNoGIL() : OwnedRef() {}
+  OwnedRefNoGIL(OwnedRefNoGIL&& other) : OwnedRef(other.detach()) {}
+  explicit OwnedRefNoGIL(PyObject* obj) : OwnedRef(obj) {}
+
   ~OwnedRefNoGIL() {
 PyAcquireGIL lock;
 reset();
diff --git a/cpp/src/arrow/python/python-test.cc 
b/cpp/src/arrow/python/python-test.cc
index bcf89a4f6..a2b832bdb 100644
--- a/cpp/src/arrow/python/python-test.cc
+++ b/cpp/src/arrow/python/python-test.cc
@@ -42,6 +42,40 @@ TEST(PyBuffer, InvalidInputObject) {
   ASSERT_EQ(old_refcnt, Py_REFCNT(input));
 }
 
+TEST(OwnedRef, TestMoves) {
+  PyAcquireGIL lock;
+  std::vector vec;
+  PyObject *u, *v;
+  u = PyList_New(0);
+  v = PyList_New(0);
+  {
+OwnedRef ref(u);
+vec.push_back(std::move(ref));
+ASSERT_EQ(ref.obj(), nullptr);
+  }
+  vec.emplace_back(v);
+  ASSERT_EQ(Py_REFCNT(u), 1);
+  ASSERT_EQ(Py_REFCNT(v), 1);
+}
+
+TEST(OwnedRefNoGIL, TestMoves) {
+  std::vector vec;
+  PyObject *u, *v;
+  {
+PyAcquireGIL lock;
+u = PyList_New(0);
+v = PyList_New(0);
+  }
+  {
+OwnedRefNoGIL ref(u);
+vec.push_back(std::move(ref));
+ASSERT_EQ(ref.obj(), nullptr);
+  }
+  vec.emplace_back(v);
+  ASSERT_EQ(Py_REFCNT(u), 1);
+  ASSERT_EQ(Py_REFCNT(v), 1);
+}
+
 class DecimalTest : public ::testing::Test {
  public:
   DecimalTest() : lock_(), decimal_module_(), decimal_constructor_() {


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] OwnedRef is fragile
> 
>
> Key: ARROW-2171
> URL: https://issues.apache.org/jira/browse/ARROW-2171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Some uses of OwnedRef can implicitly invoke its (default) copy constructor, 
> which will lead to extraneous decrefs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType

2018-02-19 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-2177:


Assignee: Phillip Cloud

> [C++] Remove support for specifying negative scale values in DecimalType
> 
>
> Key: ARROW-2177
> URL: https://issues.apache.org/jira/browse/ARROW-2177
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >