[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433402#comment-16433402
 ] 

ASF GitHub Bot commented on ARROW-2445:
---

xhochy commented on a change in pull request #1881: ARROW-2445: [Rust] Add 
documentation and make some fields private
URL: https://github.com/apache/arrow/pull/1881#discussion_r180635435
 
 

 ##
 File path: rust/src/array.rs
 ##
 @@ -86,13 +90,25 @@ impl Array {
 }
 }
 
+/// Get a reference to the array data
 pub fn data(&self) -> &ArrayData {
 &self.data
 }
 
+/// number of elements in the array
 pub fn len(&self) -> usize {
 self.len as usize
 }
+
+/// number of null elements in the array
+pub fn null_count(&self) -> usize {
+self.null_count as usize
+}
+
+/// If null_count is greater than zero then the validity_bitmap will be 
Some(Bitmap)
+pub fn validity_bitmap(&self) -> &Option {
 
 Review comment:
   We have seen some cases in C++ that it is beneficial in a lot of cases to 
work directly on the bitmap as the compiler can drastically speedup access to 
it using SIMD instructions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2407) [GLib] Add garrow_string_array_builder_append_values()

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433351#comment-16433351
 ] 

ASF GitHub Bot commented on ARROW-2407:
---

kou closed pull request #1845: ARROW-2407: [GLib] Add 
garrow_string_array_builder_append_values()
URL: https://github.com/apache/arrow/pull/1845
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c_glib/arrow-glib/array-builder.cpp 
b/c_glib/arrow-glib/array-builder.cpp
index 7625bcd51..f12999881 100644
--- a/c_glib/arrow-glib/array-builder.cpp
+++ b/c_glib/arrow-glib/array-builder.cpp
@@ -443,7 +443,7 @@ 
garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder,
  * @is_valids_length: The length of `is_valids`.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple values at once. It's efficient than multiple
+ * Append multiple values at once. It's more efficient than multiple
  * `append()` and `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -495,7 +495,7 @@ 
garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder,
  * @n: The number of null values to be appended.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple nulls at once. It's efficient than multiple
+ * Append multiple nulls at once. It's more efficient than multiple
  * `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -580,7 +580,7 @@ garrow_int_array_builder_append(GArrowIntArrayBuilder 
*builder,
  * @is_valids_length: The length of `is_valids`.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple values at once. It's efficient than multiple
+ * Append multiple values at once. It's more efficient than multiple
  * `append()` and `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -630,7 +630,7 @@ garrow_int_array_builder_append_null(GArrowIntArrayBuilder 
*builder,
  * @n: The number of null values to be appended.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple nulls at once. It's efficient than multiple
+ * Append multiple nulls at once. It's more efficient than multiple
  * `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -715,7 +715,7 @@ garrow_uint_array_builder_append(GArrowUIntArrayBuilder 
*builder,
  * @is_valids_length: The length of `is_valids`.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple values at once. It's efficient than multiple
+ * Append multiple values at once. It's more efficient than multiple
  * `append()` and `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -765,7 +765,7 @@ 
garrow_uint_array_builder_append_null(GArrowUIntArrayBuilder *builder,
  * @n: The number of null values to be appended.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple nulls at once. It's efficient than multiple
+ * Append multiple nulls at once. It's more efficient than multiple
  * `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -845,7 +845,7 @@ garrow_int8_array_builder_append(GArrowInt8ArrayBuilder 
*builder,
  * @is_valids_length: The length of `is_valids`.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple values at once. It's efficient than multiple
+ * Append multiple values at once. It's more efficient than multiple
  * `append()` and `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -893,7 +893,7 @@ 
garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder,
  * @n: The number of null values to be appended.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple nulls at once. It's efficient than multiple
+ * Append multiple nulls at once. It's more efficient than multiple
  * `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -973,7 +973,7 @@ garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder 
*builder,
  * @is_valids_length: The length of `is_valids`.
  * @error: (nullable): Return location for a #GError or %NULL.
  *
- * Append multiple values at once. It's efficient than multiple
+ * Append multiple values at once. It's more efficient than multiple
  * `append()` and `append_null()` calls.
  *
  * Returns: %TRUE on success, %FALSE if there was an error.
@@ -1021,7 +1021,7 @@ 
garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder,
  * @n: The number of null values

[jira] [Resolved] (ARROW-2407) [GLib] Add garrow_string_array_builder_append_values()

2018-04-10 Thread Kouhei Sutou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-2407.
-
Resolution: Fixed

Issue resolved by pull request 1845
[https://github.com/apache/arrow/pull/1845]

> [GLib] Add garrow_string_array_builder_append_values()
> --
>
> Key: ARROW-2407
> URL: https://issues.apache.org/jira/browse/ARROW-2407
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Affects Versions: 0.9.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1642#comment-1642
 ] 

ASF GitHub Bot commented on ARROW-2445:
---

liurenjie1024 commented on a change in pull request #1881: ARROW-2445: [Rust] 
Add documentation and make some fields private
URL: https://github.com/apache/arrow/pull/1881#discussion_r180619565
 
 

 ##
 File path: rust/src/array.rs
 ##
 @@ -69,10 +69,14 @@ arraydata_from_primitive!(u32, UInt32);
 arraydata_from_primitive!(u64, UInt64);
 
 pub struct Array {
-pub len: i32,
-pub null_count: i32,
-pub validity_bitmap: Option,
-pub data: ArrayData,
+/// number of elements in the array
+len: i32,
+/// number of null elements in the array
+null_count: i32,
+/// If null_count is greater than zero then the validity_bitmap will be 
Some(Bitmap)
+validity_bitmap: Option,
+/// The array of elements
+data: ArrayData,
 
 Review comment:
   Add a field for data type?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1641#comment-1641
 ] 

ASF GitHub Bot commented on ARROW-2445:
---

liurenjie1024 commented on a change in pull request #1881: ARROW-2445: [Rust] 
Add documentation and make some fields private
URL: https://github.com/apache/arrow/pull/1881#discussion_r180619803
 
 

 ##
 File path: rust/src/array.rs
 ##
 @@ -86,13 +90,25 @@ impl Array {
 }
 }
 
+/// Get a reference to the array data
 pub fn data(&self) -> &ArrayData {
 &self.data
 }
 
+/// number of elements in the array
 pub fn len(&self) -> usize {
 self.len as usize
 }
+
+/// number of null elements in the array
+pub fn null_count(&self) -> usize {
+self.null_count as usize
+}
+
+/// If null_count is greater than zero then the validity_bitmap will be 
Some(Bitmap)
+pub fn validity_bitmap(&self) -> &Option {
 
 Review comment:
   I don't think it's a good idea to expose validity bitmap directly to user. I 
think we should just add methods to test whether a position is set.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread Andy Grove (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-2445:
--
Summary: [Rust] Add documentation and make some fields private  (was: Add 
documentation and make some fields private)

> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433305#comment-16433305
 ] 

ASF GitHub Bot commented on ARROW-2445:
---

andygrove opened a new pull request #1881: ARROW-2445: [Rust] Add documentation 
and make some fields private
URL: https://github.com/apache/arrow/pull/1881
 
 
   No functional changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread Andy Grove (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433306#comment-16433306
 ] 

Andy Grove commented on ARROW-2445:
---

PR: https://github.com/apache/arrow/pull/1881

> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2445) [Rust] Add documentation and make some fields private

2018-04-10 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2445:
--
Labels: pull-request-available  (was: )

> [Rust] Add documentation and make some fields private
> -
>
> Key: ARROW-2445
> URL: https://issues.apache.org/jira/browse/ARROW-2445
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> A first pass at adding rustdoc comments and made some struct fields private 
> and added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2445) Add documentation and make some fields private

2018-04-10 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2445:
-

 Summary: Add documentation and make some fields private
 Key: ARROW-2445
 URL: https://issues.apache.org/jira/browse/ARROW-2445
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 0.10.0


A first pass at adding rustdoc comments and made some struct fields private and 
added accessor methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433294#comment-16433294
 ] 

ASF GitHub Bot commented on ARROW-2435:
---

liurenjie1024 commented on a change in pull request #1875: ARROW-2435: [Rust] 
Add memory pool abstraction.
URL: https://github.com/apache/arrow/pull/1875#discussion_r180616961
 
 

 ##
 File path: rust/src/memory_pool.rs
 ##
 @@ -0,0 +1,93 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use libc;
+use std::mem;
+use std::cmp;
+
+use super::error::ArrowError;
+use super::error::Result;
+
+const ALIGNMENT: usize = 64;
+
+/// Memory pool for allocating memory. It's also responsible for tracking 
memory usage.
+pub trait MemoryPool {
+/// Allocate memory.
+/// The implementation should ensures that allocated memory is aligned.
+fn allocate(&self, size: usize) -> Result<*mut u8>;
+
+/// Reallocate memory.
+/// If the implementation doesn't support reallocating aligned memory, it 
allocates new memory
+/// and copied old memory to it.
+fn reallocate(&self, old_size: usize, new_size: usize, pointer: *mut u8) 
-> Result<*mut u8>;
+
+/// Free memory.
+fn free(&self, ptr: *mut u8);
+}
+
+/// Implementation of memory pool using lib api.
+#[allow(dead_code)]
+struct LibcMemoryPool;
+
+impl MemoryPool for LibcMemoryPool {
+fn allocate(&self, size: usize) -> Result<*mut u8> {
+unsafe {
+let mut page: *mut libc::c_void = mem::uninitialized();
+let result = libc::posix_memalign(&mut page, ALIGNMENT, size);
+match result {
+0 => Ok(mem::transmute::<*mut libc::c_void, *mut u8>(page)),
+_ => Err(ArrowError::MemoryError(
+"Failed to allocate memory".to_string(),
+)),
+}
+}
+}
+
+fn reallocate(&self, old_size: usize, new_size: usize, pointer: *mut u8) 
-> Result<*mut u8> {
+unsafe {
+let result = self.allocate(new_size)?;
+let dst = mem::transmute::<*mut u8, *mut libc::c_void>(result);
+libc::memcpy(
+dst,
+mem::transmute::<*mut u8, *const libc::c_void>(pointer),
+cmp::min(old_size, new_size),
+);
 
 Review comment:
   Thanks for the review, and I've fixed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add memory pool abstraction.
> ---
>
> Key: ARROW-2435
> URL: https://issues.apache.org/jira/browse/ARROW-2435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.9.0
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add memory pool abstraction as the c++ api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2407) [GLib] Add garrow_string_array_builder_append_values()

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433217#comment-16433217
 ] 

ASF GitHub Bot commented on ARROW-2407:
---

kou commented on issue #1845: ARROW-2407: [GLib] Add 
garrow_string_array_builder_append_values()
URL: https://github.com/apache/arrow/pull/1845#issuecomment-380286802
 
 
   OK.
   I've rebased and changed to use the new `Append()` method.
   
   If CI is passed, I'll merge this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [GLib] Add garrow_string_array_builder_append_values()
> --
>
> Key: ARROW-2407
> URL: https://issues.apache.org/jira/browse/ARROW-2407
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Affects Versions: 0.9.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433202#comment-16433202
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180601792
 
 

 ##
 File path: cpp/src/arrow/python/decimal.cc
 ##
 @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t 
suggested_precision, int32_t suggested_sc
 }
 
 Status DecimalMetadata::Update(PyObject* object) {
-  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  bool is_decimal = PyDecimal_Check(object);
 
 Review comment:
   So you mean remove `PyDecimal_Check` all together?  This is only called when 
the type is not specified by the user and then yes, it will end doing 2 passes 
over the objects and checks both times if they are decimal.  It might be 
possible to do less checks on the second pass if we keep a list of which ones 
are decimal objects, but I'm not sure that would be worth it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433206#comment-16433206
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180602058
 
 

 ##
 File path: cpp/src/arrow/python/decimal.cc
 ##
 @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t 
suggested_precision, int32_t suggested_sc
 }
 
 Status DecimalMetadata::Update(PyObject* object) {
-  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  bool is_decimal = PyDecimal_Check(object);
 
 Review comment:
   Fair enough, we can optimize later if we find it too slow. The conversion 
itself is very slow anyway :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433182#comment-16433182
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380279472
 
 
   > I would think it's okay to remove them if they weren't there before the PR 
that introduced boost-regex.
   
   It seems the boost-regex references in the Python build files were 
introduced specifically for parquet: see cb5da9c1 and 8b1c8118.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433180#comment-16433180
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278833
 
 
   Oh, you're right, I had misread the regex (it has `[-+]?` for the exponent 
part). The behaviour is unchanged then.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433178#comment-16433178
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278645
 
 
   > By the way, I don't know if it's ok to remove boost-regex references from 
the Python build files, since parquet still requires boost-regex AFAIR.
   
   @xhochy Can you comment on this?
   
   I would think it's okay to remove them if they weren't there before the PR 
that introduced boost-regex.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433175#comment-16433175
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278461
 
 
   I think the current version supports that, does it not? I remember adding 
that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433176#comment-16433176
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278524
 
 
   I may be misunderstand the usage of "should" here :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433174#comment-16433174
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278313
 
 
   Another note: the new implementation should support exponents without a 
sign, e.g. "1.23e45". The Python decimal module also does this. I need to add 
tests, though.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433170#comment-16433170
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex 
dependency
URL: https://github.com/apache/arrow/pull/1880#issuecomment-380277704
 
 
   By the way, I don't know if it's ok to remove boost-regex references from 
the Python build files, since parquet still requires boost-regex AFAIR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433160#comment-16433160
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180594466
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   By no means should you feel obliged to do that extra work :) If this works, 
then that's great. I thought maybe you had chosen not to use iterators for a 
particular reason.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433158#comment-16433158
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180594140
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
+  size_t pos = 0;
 
-if (scale != nullptr) {
-  *scale = 0;
+  if (size == 0) {
+return false;
+  }
+  // Sign of the number
+  if (IsSign(s[pos])) {
+out->sign = std::string(s + pos, 1);
+++pos;
+  }
+  // First run of digits
+  pos = ParseDigitsRun(s, pos, size, &out->whole_digits);
+  if (pos == size) {
+return !out->whole_digits.empty();
+  }
+  // Optional dot (if given in fractional form)
+  bool has_dot = IsDot(s[pos]);
+  if (has_dot) {
+// Second run of digits
+++pos;
+pos = ParseDigitsRun(s, pos, size, &out->fractional_digits);
+  }
+  if (out->whole_digits.empty() && out->fractional_digits.empty()) {
+// Need at least some digits (whole or fractional)
+return false;
+  }
+  if (pos == size) {
+return true;
+  }
+  // Optional exponent
+  if (StartsExponent(s[pos])) {
+++pos;
+if (pos == size) {
+  return false;
+}
+// Optional exponent sign
+if (IsSign(s[pos])) {
+  out->exponent_sign = std::string(s + pos, 1);
+  ++pos;
+}
+pos = ParseDigitsRun(s, pos, size, &out->exponent_digits);
+if (out->exponent_digits.empty()) {
+  // Need some exponent digits
+  return false;
 }
-
-*out = 0;
-return Status::OK();
   }
+  return pos == size;
+}
 
-  boost::smatch results;
-  const bool matches = boost::regex_match(s, results, DECIMAL_REGEX);
+}  // namespace
 
-  if (!matches) {
-std::stringstream ss;
-ss << "The string " << s << " is not a valid decimal number";
-return Status::Invalid(ss.str());
+Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
+  int32_t* scale) {
+  if (s.empty()) {
+return Status::Invalid("Empty string cannot be converted to decimal");
   }
 
-  const std::string sign = results["SIGN"];
-  const std::string integer = results["INTEGER"];
-
-  const std::string left_digits = results["LEFT_DIGITS"];
-  const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"];
-
-  const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"];
-
-  const std::string first_exp_value = results["FIRST_EXP_VALUE"];
-  const std::string second_exp_value = results["SECOND_EXP_VALUE"];
-
-  std::string whole_part;
-  std::string fractional_part;
-  std::string exponent_value;
-
-  if (!integer.empty()) {
-whole_part = integer;
-  } else if (!left_digits.empty()) {
-DCHECK(second_right_digits.empty()) << s << " " << second_right_digits;
-whole_part = left_digits;
-fractional_part = first_right_digits;
-  } else {
-DCHECK(first_right_digits.empty()) << s << " " << first_right_digits;
-fractional_part = second_right_digits;
+  DecimalComponents dec;
+  if (!ParseDecimalComponents

[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433155#comment-16433155
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180593829
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   I'll try an iterator version tomorrow and see if I like it better. It's a 
bit late tonight ;-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433154#comment-16433154
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180593708
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   Oh, I just mean that iterators abstract over pointer arithmetic and are 
built into the STL. It feels like C as opposed to C++, that's all.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433138#comment-16433138
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592157
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
 Review comment:
   don't need parens here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433150#comment-16433150
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180593109
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
+  size_t pos = 0;
 
-if (scale != nullptr) {
-  *scale = 0;
+  if (size == 0) {
+return false;
+  }
+  // Sign of the number
+  if (IsSign(s[pos])) {
+out->sign = std::string(s + pos, 1);
+++pos;
+  }
+  // First run of digits
+  pos = ParseDigitsRun(s, pos, size, &out->whole_digits);
+  if (pos == size) {
+return !out->whole_digits.empty();
+  }
+  // Optional dot (if given in fractional form)
+  bool has_dot = IsDot(s[pos]);
+  if (has_dot) {
+// Second run of digits
+++pos;
+pos = ParseDigitsRun(s, pos, size, &out->fractional_digits);
+  }
+  if (out->whole_digits.empty() && out->fractional_digits.empty()) {
+// Need at least some digits (whole or fractional)
+return false;
+  }
+  if (pos == size) {
+return true;
+  }
+  // Optional exponent
+  if (StartsExponent(s[pos])) {
+++pos;
+if (pos == size) {
+  return false;
+}
+// Optional exponent sign
+if (IsSign(s[pos])) {
+  out->exponent_sign = std::string(s + pos, 1);
+  ++pos;
+}
+pos = ParseDigitsRun(s, pos, size, &out->exponent_digits);
+if (out->exponent_digits.empty()) {
+  // Need some exponent digits
+  return false;
 }
-
-*out = 0;
-return Status::OK();
   }
+  return pos == size;
+}
 
-  boost::smatch results;
-  const bool matches = boost::regex_match(s, results, DECIMAL_REGEX);
+}  // namespace
 
-  if (!matches) {
-std::stringstream ss;
-ss << "The string " << s << " is not a valid decimal number";
-return Status::Invalid(ss.str());
+Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
+  int32_t* scale) {
+  if (s.empty()) {
+return Status::Invalid("Empty string cannot be converted to decimal");
   }
 
-  const std::string sign = results["SIGN"];
-  const std::string integer = results["INTEGER"];
-
-  const std::string left_digits = results["LEFT_DIGITS"];
-  const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"];
-
-  const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"];
-
-  const std::string first_exp_value = results["FIRST_EXP_VALUE"];
-  const std::string second_exp_value = results["SECOND_EXP_VALUE"];
-
-  std::string whole_part;
-  std::string fractional_part;
-  std::string exponent_value;
-
-  if (!integer.empty()) {
-whole_part = integer;
-  } else if (!left_digits.empty()) {
-DCHECK(second_right_digits.empty()) << s << " " << second_right_digits;
-whole_part = left_digits;
-fractional_part = first_right_digits;
-  } else {
-DCHECK(first_right_digits.empty()) << s << " " << first_right_digits;
-fractional_part = second_right_digits;
+  DecimalComponents dec;
+  if (!ParseDecimalComponents(

[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433147#comment-16433147
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592851
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
 Review comment:
   I was too lazy to look it up :-) Will do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433146#comment-16433146
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592739
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   I find it much readable like that. I don't know what "subverting the 
language" means.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433145#comment-16433145
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592574
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
+  size_t pos = 0;
 
-if (scale != nullptr) {
-  *scale = 0;
+  if (size == 0) {
+return false;
+  }
+  // Sign of the number
+  if (IsSign(s[pos])) {
+out->sign = std::string(s + pos, 1);
+++pos;
+  }
+  // First run of digits
+  pos = ParseDigitsRun(s, pos, size, &out->whole_digits);
+  if (pos == size) {
+return !out->whole_digits.empty();
+  }
+  // Optional dot (if given in fractional form)
+  bool has_dot = IsDot(s[pos]);
+  if (has_dot) {
+// Second run of digits
+++pos;
+pos = ParseDigitsRun(s, pos, size, &out->fractional_digits);
+  }
+  if (out->whole_digits.empty() && out->fractional_digits.empty()) {
+// Need at least some digits (whole or fractional)
+return false;
+  }
+  if (pos == size) {
+return true;
+  }
+  // Optional exponent
+  if (StartsExponent(s[pos])) {
+++pos;
+if (pos == size) {
+  return false;
+}
+// Optional exponent sign
+if (IsSign(s[pos])) {
+  out->exponent_sign = std::string(s + pos, 1);
+  ++pos;
+}
+pos = ParseDigitsRun(s, pos, size, &out->exponent_digits);
+if (out->exponent_digits.empty()) {
+  // Need some exponent digits
+  return false;
 }
-
-*out = 0;
-return Status::OK();
   }
+  return pos == size;
+}
 
-  boost::smatch results;
-  const bool matches = boost::regex_match(s, results, DECIMAL_REGEX);
+}  // namespace
 
-  if (!matches) {
-std::stringstream ss;
-ss << "The string " << s << " is not a valid decimal number";
-return Status::Invalid(ss.str());
+Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
+  int32_t* scale) {
+  if (s.empty()) {
+return Status::Invalid("Empty string cannot be converted to decimal");
   }
 
-  const std::string sign = results["SIGN"];
-  const std::string integer = results["INTEGER"];
-
-  const std::string left_digits = results["LEFT_DIGITS"];
-  const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"];
-
-  const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"];
-
-  const std::string first_exp_value = results["FIRST_EXP_VALUE"];
-  const std::string second_exp_value = results["SECOND_EXP_VALUE"];
-
-  std::string whole_part;
-  std::string fractional_part;
-  std::string exponent_value;
-
-  if (!integer.empty()) {
-whole_part = integer;
-  } else if (!left_digits.empty()) {
-DCHECK(second_right_digits.empty()) << s << " " << second_right_digits;
-whole_part = left_digits;
-fractional_part = first_right_digits;
-  } else {
-DCHECK(first_right_digits.empty()) << s << " " << first_right_digits;
-fractional_part = second_right_digits;
+  DecimalComponents dec;
+  if (!ParseDecimalComponents

[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433139#comment-16433139
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180592291
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
 Review comment:
   Why not use `std::isdigit` for this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433135#comment-16433135
 ] 

ASF GitHub Bot commented on ARROW-2435:
---

paddyhoran commented on issue #1875: ARROW-2435: [Rust] Add memory pool 
abstraction.
URL: https://github.com/apache/arrow/pull/1875#issuecomment-380273948
 
 
   As CI is not currently setup for Windows I think it’s fine to push ahead 
with this change without windows support, then reopen the windows support jira 
and I’ll address it in a separate PR.
   
   I seem to be the only person on Windows and momentum is more important at 
this stage of the rust implementation.
   
   Once we have windows CI setup I think we try to keep windows support working 
on all PRs.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add memory pool abstraction.
> ---
>
> Key: ARROW-2435
> URL: https://issues.apache.org/jira/browse/ARROW-2435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.9.0
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add memory pool abstraction as the c++ api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433136#comment-16433136
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] 
Remove boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880#discussion_r180591991
 
 

 ##
 File path: cpp/src/arrow/util/decimal.cc
 ##
 @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, 
Decimal128* out) {
   }
 }
 
-static const boost::regex DECIMAL_REGEX(
-// sign of the number
-"(?[-+]?)"
-
-// digits around the decimal point
-
"(((?\\d+)\\.(?\\d*)|\\.(?\\d+)"
-")"
+namespace {
 
-// optional exponent
-"([eE](?[-+]?\\d+))?"
+struct DecimalComponents {
+  std::string sign;
+  std::string whole_digits;
+  std::string fractional_digits;
+  std::string exponent_sign;
+  std::string exponent_digits;
+};
 
-// otherwise
-"|"
+inline bool IsSign(char c) { return (c == '-' || c == '+'); }
 
-// we're just an integer
-"(?\\d+)"
+inline bool IsDot(char c) { return c == '.'; }
 
-// or an integer with an exponent
-"(?:[eE](?[-+]?\\d+))?)");
+inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); }
 
-static inline bool is_zero_character(char c) { return c == '0'; }
+inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); }
 
-Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* 
precision,
-  int32_t* scale) {
-  if (s.empty()) {
-return Status::Invalid("Empty string cannot be converted to decimal");
+inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, 
std::string* out) {
+  size_t pos;
+  for (pos = start; pos < size; ++pos) {
+if (!IsDigit(s[pos])) {
+  break;
+}
   }
+  *out = std::string(s + start, pos - start);
+  return pos;
+}
 
-  // case of all zeros
-  if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) {
-if (precision != nullptr) {
-  *precision = 0;
-}
+bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* 
out) {
 
 Review comment:
   Why not use iterators instead of a pointer plus size? It seems like we're 
subverting the language.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2224:
--
Labels: pull-request-available  (was: )

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433117#comment-16433117
 ] 

ASF GitHub Bot commented on ARROW-2224:
---

pitrou opened a new pull request #1880: [WIP] ARROW-2224: [C++] Remove 
boost-regex dependency
URL: https://github.com/apache/arrow/pull/1880
 
 
   Use a hand-written decimal parser.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432996#comment-16432996
 ] 

Antoine Pitrou commented on ARROW-2224:
---

Note that the day where we really care about decimal performance, we should 
probably use Stefan Krah's libmpdec, which is also used by the Python _decimal 
module. It's a pure C library, is fast and would probably allow us to have 
tighter interoperation with Python decimals.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432995#comment-16432995
 ] 

Phillip Cloud edited comment on ARROW-2224 at 4/10/18 9:11 PM:
---

The original implementation used C++ iterators to implement a handwritten 
parser, but it quickly got complex once we needed to support integers with 
exponential notation. Maybe the complexity of dealing with boost is greater 
than the complexity of debugging a handwritten parser, in which case it would 
be worth it to reimplement it without a regular expression.


was (Author: cpcloud):
The original implementation used C++ to implement a handwritten parser, but it 
quickly got complex once we needed to support integers with exponential 
notation. Maybe the complexity of dealing with boost is greater than the 
complexity of debugging a handwritten parser, in which case it would be worth 
it to reimplement it without a regular expression.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432995#comment-16432995
 ] 

Phillip Cloud commented on ARROW-2224:
--

The original implementation used C++ to implement a handwritten parser, but it 
quickly got complex once we needed to support integers with exponential 
notation. Maybe the complexity of dealing with boost is greater than the 
complexity of debugging a handwritten parser, in which case it would be worth 
it to reimplement it without a regular expression.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432991#comment-16432991
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180568102
 
 

 ##
 File path: cpp/src/arrow/python/decimal.cc
 ##
 @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t 
suggested_precision, int32_t suggested_sc
 }
 
 Status DecimalMetadata::Update(PyObject* object) {
-  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  bool is_decimal = PyDecimal_Check(object);
 
 Review comment:
   Right now we are doing the check twice in optimized builds, which is not 
nice IMHO. `DecimalMetadata::Update` is a private API so it's up to the caller 
to provide appropriate input.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432987#comment-16432987
 ] 

Antoine Pitrou commented on ARROW-2224:
---

How important is it to have fast parsing of decimal types? If we reimplement 
our own parsing, we can tweak performance *and* remove the troublesome 
dependency.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432965#comment-16432965
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180564291
 
 

 ##
 File path: cpp/src/arrow/python/decimal.cc
 ##
 @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t 
suggested_precision, int32_t suggested_sc
 }
 
 Status DecimalMetadata::Update(PyObject* object) {
-  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  bool is_decimal = PyDecimal_Check(object);
 
 Review comment:
   This isn't necessary because I added a check before calling `Update` but it 
does prevent a seg fault if for some reason it's called with non Decimal 
objects - which is not nice to get.  If it's hurts an optimization though, I 
can remove it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432952#comment-16432952
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180562954
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
+RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  }
 }
 
 type_ =
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432946#comment-16432946
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180562328
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
 
 Review comment:
   Seems like it could be NaN also:
   ```
   In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')])
   
   In [6]: s2 = pd.Series([Decimal('2.0'), None])
   
   In [7]: s1 / s2
   Out[7]: 
   00.5
   1NaN
   dtype: object


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432939#comment-16432939
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

cpcloud commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180561137
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
 
 Review comment:
   Python decimal objects can be nan, unfortunately:
   
   ```python
   >>> import decimal
   >>> decimal.Decimal('nan')
   Decimal('NaN')
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432931#comment-16432931
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180559958
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
 
 Review comment:
   I'm not sure, is it possible to get NaNs from operations on Decimals?  Or is 
that something the user might mix in somehow? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432928#comment-16432928
 ] 

Phillip Cloud commented on ARROW-2224:
--

The "should" here isn't related to linking issues, rather, it's related to 
performance. See here for some benchmarks: 
http://lh3lh3.users.sourceforge.net/reb.shtml

Those benchmarks are now coming up on 8 years old, so we should revisit that.

Some of the linking issues might be mitigated by using {{}} on Windows 
and {{regcomp}} on UNIX platforms. I'm unaware of the perils of that solution 
(if any).

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432925#comment-16432925
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180559142
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -1326,6 +1326,22 @@ def test_decimal_with_different_precisions(self):
 expected = [decimal.Decimal('0.01000'), decimal.Decimal('0.00100')]
 assert array.to_pylist() == expected
 
+def test_decimal_with_None_explicit_type(self):
+data = [
+decimal.Decimal('3.14'),
+None,
+]
+series = pd.Series(data)
+_check_series_roundtrip(series, type_=pa.decimal128(12, 5))
+
+def test_decimal_with_None_infer_type(self):
 
 Review comment:
   sure


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432926#comment-16432926
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] 
Fix Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180559192
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
+RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  }
 }
 
 type_ =
 
 Review comment:
   I'll add that


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432923#comment-16432923
 ] 

Bryan Cutler commented on ARROW-2432:
-

BTW, it looks like {{pa.array(data, type=pa.decimal128(12, 4), 
from_pandas=True)}} has the same code path that causes the error, as long as 
{{data}} is a pandas.Series

{noformat}
In [5]: pa.array(pd.Series([Decimal('3.14'), None]), from_pandas=True, 
type=pa.decimal128(12, 4))
---
ArrowInvalid  Traceback (most recent call last)
 in ()
> 1 pa.array(pd.Series([Decimal('3.14'), None]), from_pandas=True, 
type=pa.decimal128(12, 4))

ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
object of type NoneType but can only handle these types: decimal.Decimal
{noformat}

> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432890#comment-16432890
 ] 

Antoine Pitrou commented on ARROW-2224:
---

Can you explain the "should"? AFAICS libre2 isn't header-only, so it may 
present the same linking and ABI issues as boost-regex.

> [C++] Replace boost regex usage with libre2
> ---
>
> Key: ARROW-2224
> URL: https://issues.apache.org/jira/browse/ARROW-2224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> We're using {{boost::regex}} to parse decimal strings for {{decimal128}} 
> types. We should use {{libre2}} instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2440) [Rust] Implement ListBuilder

2018-04-10 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-2440.
---
Resolution: Fixed

Issue resolved by pull request 1876
[https://github.com/apache/arrow/pull/1876]

> [Rust] Implement ListBuilder
> ---
>
> Key: ARROW-2440
> URL: https://issues.apache.org/jira/browse/ARROW-2440
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Implement ListBuilder



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict

2018-04-10 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2441:
-

Assignee: Andy Grove

> [Rust] Builder::slice_mut assertions are too strict
> --
>
> Key: ARROW-2441
> URL: https://issues.apache.org/jira/browse/ARROW-2441
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The assertions only allow slice up to builder length, rather than up to 
> builder capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432846#comment-16432846
 ] 

ASF GitHub Bot commented on ARROW-2441:
---

pitrou closed pull request #1877: ARROW-2441: [Rust] Builder::slice_mut 
assertions are too strict
URL: https://github.com/apache/arrow/pull/1877
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/rust/src/builder.rs b/rust/src/builder.rs
index 15f492d6d..421b1a9a1 100644
--- a/rust/src/builder.rs
+++ b/rust/src/builder.rs
@@ -65,8 +65,8 @@ impl Builder {
 /// Get the internal byte-aligned memory buffer as a mutable slice
 pub fn slice_mut(&self, start: usize, end: usize) -> &mut [T] {
 assert!(start <= end);
-assert!(start < self.len as usize);
-assert!(end <= self.len as usize);
+assert!(start < self.capacity as usize);
+assert!(end <= self.capacity as usize);
 unsafe {
 slice::from_raw_parts_mut(self.data.offset(start as isize), (end - 
start) as usize)
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Builder::slice_mut assertions are too strict
> --
>
> Key: ARROW-2441
> URL: https://issues.apache.org/jira/browse/ARROW-2441
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The assertions only allow slice up to builder length, rather than up to 
> builder capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2440) [Rust] Implement ListBuilder

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432848#comment-16432848
 ] 

ASF GitHub Bot commented on ARROW-2440:
---

pitrou closed pull request #1876: ARROW-2440: [Rust] Implement ListBuilder
URL: https://github.com/apache/arrow/pull/1876
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/rust/src/lib.rs b/rust/src/lib.rs
index 6ab3daabb..e0c14db87 100644
--- a/rust/src/lib.rs
+++ b/rust/src/lib.rs
@@ -28,4 +28,5 @@ pub mod builder;
 pub mod datatypes;
 pub mod error;
 pub mod list;
+pub mod list_builder;
 pub mod memory;
diff --git a/rust/src/list.rs b/rust/src/list.rs
index a7b11a628..fad0ed372 100644
--- a/rust/src/list.rs
+++ b/rust/src/list.rs
@@ -17,9 +17,8 @@
 
 use std::str;
 
-use bytes::{BufMut, BytesMut};
-
 use super::buffer::Buffer;
+use super::list_builder::ListBuilder;
 
 pub struct List {
 pub data: Buffer,
@@ -40,19 +39,11 @@ impl List {
 
 impl From> for List {
 fn from(v: Vec) -> Self {
-let mut offsets: Vec = Vec::with_capacity(v.len() + 1);
-let mut buf = BytesMut::with_capacity(v.len() * 32);
-offsets.push(0_i32);
+let mut b: ListBuilder = ListBuilder::with_capacity(v.len());
 v.iter().for_each(|s| {
-let slice = s.as_bytes();
-buf.reserve(slice.len());
-buf.put(slice);
-offsets.push(buf.len() as i32);
+b.push(s.as_bytes());
 });
-List {
-data: Buffer::from(buf.freeze()),
-offsets: Buffer::from(offsets),
-}
+b.finish()
 }
 }
 
diff --git a/rust/src/list_builder.rs b/rust/src/list_builder.rs
new file mode 100644
index 0..0122e680e
--- /dev/null
+++ b/rust/src/list_builder.rs
@@ -0,0 +1,66 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use super::builder::*;
+use super::list::List;
+
+pub struct ListBuilder {
+data: Builder,
+offsets: Builder,
+}
+
+impl ListBuilder {
+pub fn new() -> Self {
+ListBuilder::with_capacity(64)
+}
+
+pub fn with_capacity(n: usize) -> Self {
+let data = Builder::with_capacity(n);
+let mut offsets = Builder::with_capacity(n);
+offsets.push(0_i32);
+ListBuilder { data, offsets }
+}
+
+pub fn push(&mut self, slice: &[T]) {
+self.data.push_slice(slice);
+self.offsets.push(self.data.len() as i32);
+}
+
+pub fn finish(&mut self) -> List {
+List {
+data: self.data.finish(),
+offsets: self.offsets.finish(),
+}
+}
+}
+
+#[cfg(test)]
+mod tests {
+use super::*;
+
+#[test]
+fn test_list_u8() {
+let mut b: ListBuilder = ListBuilder::new();
+b.push("Hello, ".as_bytes());
+b.push("World!".as_bytes());
+let buffer = b.finish();
+
+assert_eq!(2, buffer.len());
+assert_eq!("Hello, ".as_bytes(), buffer.slice(0));
+assert_eq!("World!".as_bytes(), buffer.slice(1));
+}
+}


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Implement ListBuilder
> ---
>
> Key: ARROW-2440
> URL: https://issues.apache.org/jira/browse/ARROW-2440
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Implement ListBuilder



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict

2018-04-10 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-2441.
---
Resolution: Fixed

Issue resolved by pull request 1877
[https://github.com/apache/arrow/pull/1877]

> [Rust] Builder::slice_mut assertions are too strict
> --
>
> Key: ARROW-2441
> URL: https://issues.apache.org/jira/browse/ARROW-2441
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The assertions only allow slice up to builder length, rather than up to 
> builder capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432831#comment-16432831
 ] 

ASF GitHub Bot commented on ARROW-2435:
---

andygrove commented on issue #1875: ARROW-2435: [Rust] Add memory pool 
abstraction.
URL: https://github.com/apache/arrow/pull/1875#issuecomment-380221529
 
 
   We should definitely isolate all memory alloc/free in the one file, which I 
think this PR does nicely. We just need to merge the windows changes and fix 
the oustanding issue where realloc is failing to free the old memory and I 
think we're good.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add memory pool abstraction.
> ---
>
> Key: ARROW-2435
> URL: https://issues.apache.org/jira/browse/ARROW-2435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.9.0
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add memory pool abstraction as the c++ api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432820#comment-16432820
 ] 

ASF GitHub Bot commented on ARROW-2441:
---

andygrove commented on issue #1877: ARROW-2441: [Rust] Builder::slice_mut 
assertions are too strict
URL: https://github.com/apache/arrow/pull/1877#issuecomment-380220388
 
 
   @pitrou yes please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Builder::slice_mut assertions are too strict
> --
>
> Key: ARROW-2441
> URL: https://issues.apache.org/jira/browse/ARROW-2441
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The assertions only allow slice up to builder length, rather than up to 
> builder capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2440) [Rust] Implement ListBuilder

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432817#comment-16432817
 ] 

ASF GitHub Bot commented on ARROW-2440:
---

andygrove commented on issue #1876: ARROW-2440: [Rust] Implement ListBuilder
URL: https://github.com/apache/arrow/pull/1876#issuecomment-380220309
 
 
   @pitrou yes please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Implement ListBuilder
> ---
>
> Key: ARROW-2440
> URL: https://issues.apache.org/jira/browse/ARROW-2440
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Implement ListBuilder



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2440) [Rust] Implement ListBuilder

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432799#comment-16432799
 ] 

ASF GitHub Bot commented on ARROW-2440:
---

pitrou commented on issue #1876: ARROW-2440: [Rust] Implement ListBuilder
URL: https://github.com/apache/arrow/pull/1876#issuecomment-380217336
 
 
   Is this good for merging?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Implement ListBuilder
> ---
>
> Key: ARROW-2440
> URL: https://issues.apache.org/jira/browse/ARROW-2440
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Implement ListBuilder



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432797#comment-16432797
 ] 

ASF GitHub Bot commented on ARROW-2441:
---

pitrou commented on issue #1877: ARROW-2441: [Rust] Builder::slice_mut 
assertions are too strict
URL: https://github.com/apache/arrow/pull/1877#issuecomment-380217238
 
 
   Is this good for merging?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Builder::slice_mut assertions are too strict
> --
>
> Key: ARROW-2441
> URL: https://issues.apache.org/jira/browse/ARROW-2441
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The assertions only allow slice up to builder length, rather than up to 
> builder capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432777#comment-16432777
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180532204
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
+RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  }
 }
 
 type_ =
 
 Review comment:
   By the way, what happens here if all items are None? Do we have a test for 
that?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-04-10 Thread Joshua Storck (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Storck updated ARROW-1938:
-
Comment: was deleted

(was: Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453)

> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432774#comment-16432774
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180531835
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {
 
   if (type_ == NULLPTR) {
 for (PyObject* object : objects) {
-  RETURN_NOT_OK(max_decimal_metadata.Update(object));
+  if (!internal::PandasObjectIsNull(object)) {
 
 Review comment:
   Do we care about accepting other NULL-like objects such as `float('nan')`? 
Otherwise `object != Py_None` is a much faster check.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432778#comment-16432778
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180532566
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -758,22 +760,19 @@ Status NumPyConverter::ConvertDecimals() {
   for (PyObject* object : objects) {
 const int is_decimal = PyObject_IsInstance(object, decimal_type_.obj());
 
-if (ARROW_PREDICT_FALSE(is_decimal == 0)) {
+if (is_decimal == 1) {
+  Decimal128 value;
+  RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, 
&value));
+  RETURN_NOT_OK(builder.Append(value));
+} else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) {
 
 Review comment:
   Same question as above: do we care about other NULL-like values than simply 
None?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432776#comment-16432776
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180532831
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -1326,6 +1326,22 @@ def test_decimal_with_different_precisions(self):
 expected = [decimal.Decimal('0.01000'), decimal.Decimal('0.00100')]
 assert array.to_pylist() == expected
 
+def test_decimal_with_None_explicit_type(self):
+data = [
+decimal.Decimal('3.14'),
+None,
+]
+series = pd.Series(data)
+_check_series_roundtrip(series, type_=pa.decimal128(12, 5))
+
+def test_decimal_with_None_infer_type(self):
 
 Review comment:
   Can you also check that the expected type is inferred?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432775#comment-16432775
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix 
Pandas decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#discussion_r180531517
 
 

 ##
 File path: cpp/src/arrow/python/decimal.cc
 ##
 @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t 
suggested_precision, int32_t suggested_sc
 }
 
 Status DecimalMetadata::Update(PyObject* object) {
-  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  bool is_decimal = PyDecimal_Check(object);
 
 Review comment:
   I don't think it's ok to do this in optimized build. `DecimalMetadata` 
expects you to pass a decimal object. @cpcloud may confirm.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-04-10 Thread Joshua Storck (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432773#comment-16432773
 ] 

Joshua Storck commented on ARROW-1938:
--

Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453

> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432768#comment-16432768
 ] 

ASF GitHub Bot commented on ARROW-1938:
---

joshuastorck opened a new pull request #453: Bug fix for ARROW-1938
URL: https://github.com/apache/parquet-cpp/pull/453
 
 
   The error was reported here: 
https://issues.apache.org/jira/browse/ARROW-1938.
   
   Because dictionary types are not supported in writing yet, the code converts 
the dictionary column to the actual values first before writing. However, the 
existing code was accidentally using zero as the offset and the length of the 
column as the size. This resulted in writing all of the column values for each 
chunk of the column that was supposed to be written.
   
   The fix is to pass the offset and size when recursively calling through to 
WriteColumnChunk with the "flattened" data. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

2018-04-10 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1938:
--
Labels: pull-request-available  (was: )

> [Python] Error writing to partitioned Parquet dataset
> -
>
> Key: ARROW-1938
> URL: https://issues.apache.org/jira/browse/ARROW-1938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux (Ubuntu 16.04)
>Reporter: Robert Dailey
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
> Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 1
> The command was:
> write_table_values = {'row_group_size': 1}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 1} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432756#comment-16432756
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler commented on issue #1878: ARROW-2432: [Python] Fix Pandas decimal 
type conversion with None values
URL: https://github.com/apache/arrow/pull/1878#issuecomment-380206853
 
 
   Ping @cpcloud and @pitrou, please take a look when you can, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432755#comment-16432755
 ] 

ASF GitHub Bot commented on ARROW-2432:
---

BryanCutler opened a new pull request #1878: ARROW-2432: [Python] Fix Pandas 
decimal type conversion with None values
URL: https://github.com/apache/arrow/pull/1878
 
 
   This fixes conversion of Pandas decimal types to Arrow with None values. 
Previously, if the type was specified, an error would occur when checking if 
the object was Decimal. If the type was not specified, a segmentation fault 
would occur when attempting to find the max precision and scale.
   
   Added new tests which include None values for both the above cases.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values

2018-04-10 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2432:
--
Labels: pull-request-available  (was: )

> [Python] from_pandas fails when converting decimals if have None values
> ---
>
> Key: ARROW-2432
> URL: https://issues.apache.org/jira/browse/ARROW-2432
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using from_pandas to convert decimals fails if encounters a value of 
> {{None}}. For example:
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: from decimal import Decimal
> ...:
> In [2]: s_dec = pd.Series([Decimal('3.14'), None])
> In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> ---
> ArrowInvalid Traceback (most recent call last)
>  in ()
> > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2))
> array.pxi in pyarrow.lib.Array.from_pandas()
> array.pxi in pyarrow.lib.array()
> error.pxi in pyarrow.lib.check_status()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error converting from Python objects to Decimal: Got Python 
> object of type NoneType but can only handle these types: decimal.Decimal
> {code}
> The above error is raised when specifying decimal type. When no type is 
> specified, a seg fault happens.
> This previously worked in 0.8.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Bryant Menn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432692#comment-16432692
 ] 

Bryant Menn commented on ARROW-2367:


So would the ideal implementation then create {{pa.chunked_array()}} and have 
{{pa.Table.from_pandas()}} call that instead or would there be some criteria to 
decide between {{pa.array()}} (I believe this is the current flow for 
{{pa.Tablefrom_pandas()}}) and {{pa.chunked_array()}}? I was trying to follow 
the process implemented for {{BinaryArray}} conversion to {{ChunkedArray}} but 
not finished working through and understanding it..

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432609#comment-16432609
 ] 

Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 5:20 PM:
-

Got it now Uwe, thanks! Just saw the assertion line, totally missed it.


was (Author: kszucs):
Got it now Uwe, thanks!

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432609#comment-16432609
 ] 

Krisztian Szucs commented on ARROW-2367:


Got it now Uwe, thanks!

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432608#comment-16432608
 ] 

ASF GitHub Bot commented on ARROW-2411:
---

pitrou commented on issue #1852: ARROW-2411: [C++] Add 
StringBuilder::Append(const char **values)
URL: https://github.com/apache/arrow/pull/1852#issuecomment-380179127
 
 
   Thanks @kou !


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add method to append batches of null-terminated strings to StringBuilder
> --
>
> Key: ARROW-2411
> URL: https://issues.apache.org/jira/browse/ARROW-2411
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GLib
>Reporter: Uwe L. Korn
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> We should add a method {{StringBuilder::AppendCStrings(const char** values, 
> const uint8_t* valid_bytes = NULLPTR)}} to the {{StringBuilder}} class to 
> have fast inserts for these strings. See 
> https://github.com/apache/arrow/pull/1845/files for a use case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder

2018-04-10 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2411:
-

Assignee: Kouhei Sutou

> [C++] Add method to append batches of null-terminated strings to StringBuilder
> --
>
> Key: ARROW-2411
> URL: https://issues.apache.org/jira/browse/ARROW-2411
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GLib
>Reporter: Uwe L. Korn
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> We should add a method {{StringBuilder::AppendCStrings(const char** values, 
> const uint8_t* valid_bytes = NULLPTR)}} to the {{StringBuilder}} class to 
> have fast inserts for these strings. See 
> https://github.com/apache/arrow/pull/1845/files for a use case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder

2018-04-10 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-2411.
---
Resolution: Fixed

Issue resolved by pull request 1852
[https://github.com/apache/arrow/pull/1852]

> [C++] Add method to append batches of null-terminated strings to StringBuilder
> --
>
> Key: ARROW-2411
> URL: https://issues.apache.org/jira/browse/ARROW-2411
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, GLib
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> We should add a method {{StringBuilder::AppendCStrings(const char** values, 
> const uint8_t* valid_bytes = NULLPTR)}} to the {{StringBuilder}} class to 
> have fast inserts for these strings. See 
> https://github.com/apache/arrow/pull/1845/files for a use case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432605#comment-16432605
 ] 

ASF GitHub Bot commented on ARROW-2411:
---

pitrou closed pull request #1852: ARROW-2411: [C++] Add 
StringBuilder::Append(const char **values)
URL: https://github.com/apache/arrow/pull/1852
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc
index fb1bebfca..2ae9f48db 100644
--- a/cpp/src/arrow/array-test.cc
+++ b/cpp/src/arrow/array-test.cc
@@ -1022,6 +1022,72 @@ TEST_F(TestStringBuilder, TestAppendVector) {
   }
 }
 
+TEST_F(TestStringBuilder, TestAppendCStringsWithValidBytes) {
+  const char* strings[] = {nullptr, "aaa", nullptr, "ignored", ""};
+  vector valid_bytes = {1, 1, 1, 0, 1};
+
+  int N = static_cast(sizeof(strings) / sizeof(strings[0]));
+  int reps = 1000;
+
+  for (int j = 0; j < reps; ++j) {
+ASSERT_OK(builder_->Append(strings, N, valid_bytes.data()));
+  }
+  Done();
+
+  ASSERT_EQ(reps * N, result_->length());
+  ASSERT_EQ(reps * 3, result_->null_count());
+  ASSERT_EQ(reps * 3, result_->value_data()->size());
+
+  int32_t length;
+  int32_t pos = 0;
+  for (int i = 0; i < N * reps; ++i) {
+auto string = strings[i % N];
+if (string && valid_bytes[i % N]) {
+  ASSERT_FALSE(result_->IsNull(i));
+  result_->GetValue(i, &length);
+  ASSERT_EQ(pos, result_->value_offset(i));
+  ASSERT_EQ(static_cast(strlen(string)), length);
+  ASSERT_EQ(strings[i % N], result_->GetString(i));
+
+  pos += length;
+} else {
+  ASSERT_TRUE(result_->IsNull(i));
+}
+  }
+}
+
+TEST_F(TestStringBuilder, TestAppendCStringsWithoutValidBytes) {
+  const char* strings[] = {"", "bb", "a", nullptr, "ccc"};
+
+  int N = static_cast(sizeof(strings) / sizeof(strings[0]));
+  int reps = 1000;
+
+  for (int j = 0; j < reps; ++j) {
+ASSERT_OK(builder_->Append(strings, N));
+  }
+  Done();
+
+  ASSERT_EQ(reps * N, result_->length());
+  ASSERT_EQ(reps, result_->null_count());
+  ASSERT_EQ(reps * 6, result_->value_data()->size());
+
+  int32_t length;
+  int32_t pos = 0;
+  for (int i = 0; i < N * reps; ++i) {
+if (strings[i % N]) {
+  ASSERT_FALSE(result_->IsNull(i));
+  result_->GetValue(i, &length);
+  ASSERT_EQ(pos, result_->value_offset(i));
+  ASSERT_EQ(static_cast(strlen(strings[i % N])), length);
+  ASSERT_EQ(strings[i % N], result_->GetString(i));
+
+  pos += length;
+} else {
+  ASSERT_TRUE(result_->IsNull(i));
+}
+  }
+}
+
 TEST_F(TestStringBuilder, TestZeroLength) {
   // All buffers are null
   Done();
diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc
index c97253e64..78c42f4fc 100644
--- a/cpp/src/arrow/builder.cc
+++ b/cpp/src/arrow/builder.cc
@@ -1413,6 +1413,64 @@ Status StringBuilder::Append(const 
std::vector& values,
   return Status::OK();
 }
 
+Status StringBuilder::Append(const char** values, int64_t length,
+ const uint8_t* valid_bytes) {
+  std::size_t total_length = 0;
+  std::vector value_lengths(length);
+  bool have_null_value = false;
+  for (int64_t i = 0; i < length; ++i) {
+if (values[i]) {
+  auto value_length = strlen(values[i]);
+  value_lengths[i] = value_length;
+  total_length += value_length;
+} else {
+  have_null_value = true;
+}
+  }
+  RETURN_NOT_OK(Reserve(length));
+  RETURN_NOT_OK(value_data_builder_.Reserve(total_length));
+  RETURN_NOT_OK(offsets_builder_.Reserve(length));
+
+  if (valid_bytes) {
+int64_t valid_bytes_offset = 0;
+for (int64_t i = 0; i < length; ++i) {
+  RETURN_NOT_OK(AppendNextOffset());
+  if (valid_bytes[i]) {
+if (values[i]) {
+  RETURN_NOT_OK(value_data_builder_.Append(
+  reinterpret_cast(values[i]), value_lengths[i]));
+} else {
+  UnsafeAppendToBitmap(valid_bytes + valid_bytes_offset, i - 
valid_bytes_offset);
+  UnsafeAppendToBitmap(false);
+  valid_bytes_offset = i + 1;
+}
+  }
+}
+UnsafeAppendToBitmap(valid_bytes + valid_bytes_offset, length - 
valid_bytes_offset);
+  } else {
+if (have_null_value) {
+  std::vector valid_vector(length, 0);
+  for (int64_t i = 0; i < length; ++i) {
+RETURN_NOT_OK(AppendNextOffset());
+if (values[i]) {
+  RETURN_NOT_OK(value_data_builder_.Append(
+  reinterpret_cast(values[i]), value_lengths[i]));
+  valid_vector[i] = 1;
+}
+  }
+  UnsafeAppendToBitmap(valid_vector.data(), length);
+} else {
+  for (int64_t i = 0; i < length; ++i) {
+RETURN_NOT_OK(AppendNextOffset());
+RETUR

[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432606#comment-16432606
 ] 

Uwe L. Korn commented on ARROW-2367:


The expected output would be that the resulting {{pyarrow.Table}} has a 
{{ChunkedArray}} instance for the respective column that consists of multiple 
chunks. Currently we only generate a single {{pa.Array}} and wrap that simply 
in a {{pa.ChunkedArray}}. Thus we are limited to what a single {{pa.Array}} can 
hold.

This should only happen on {{pa.Table.from_pandas()}} though. When a user calls 
{{pa.array()}} the current error is expected behaviour. We could add a second 
method {{pa.chunked_array()}} that would return a {{ChunkedArray}} with more 
than one chunk if necessary.

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc

2018-04-10 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432580#comment-16432580
 ] 

Uwe L. Korn commented on ARROW-1171:


> I'm assuming the issue is that the libstdc++ Arrow was (statically or not) 
> linked with also gets used at runtime to resolve symbols in turbodbc (or the 
> reverse)? Am I right?

Yes. There are some symbols that are in the C++ STL that are always statically 
linked. We provide a version script and don't export our symbols but sadly 
other libraries export theirs and this then leads to a crash in Arrow.

Using {{-Bsymbolic}} (and friends!) would be the path that would ensure that 
Arrow actually uses the symbols that were packaged with it. Sadly this has some 
nasty side effects. When I have some more time, I try to dive into this issue 
again and read also the linked articles. Currently this problem here is no a 
high priority for me.

> C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and 
> self-compiled turbodbc
> 
>
> Key: ARROW-1171
> URL: https://issues.apache.org/jira/browse/ARROW-1171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.4.1
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Original issue: https://github.com/blue-yonder/turbodbc/issues/102
> When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora 
> 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute 
> here is that the compiler version used for building Turbodbc is newer than 
> the one used for Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432556#comment-16432556
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-380165985
 
 
   Hi Uwe,
   
   I see what you mean. Now that I've looked more carefully into 
`BufferSlices`, I see how much danger they can bear. Something like 
[this](https://gist.github.com/alendit/f759bc12e03d9dd9d72cac90a6334cc5) would 
cause a read-after-free. 
   
   The problem, as I see it, is, that the SliceBuffer user has no way to ensure 
its validity. So even if we skip this PR, the problems with slices might happen 
in the future. I think some lean solution, like adding a `shared_ptr 
parent` to the slice and referencing its data instead, will increase the memory 
safety and might benefit this PR, too.
   
   Do you think we should discuss it on the mailing list?
   
   Cheers,
   Dimitri.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432485#comment-16432485
 ] 

ASF GitHub Bot commented on ARROW-2435:
---

paddyhoran commented on issue #1875: ARROW-2435: [Rust] Add memory pool 
abstraction.
URL: https://github.com/apache/arrow/pull/1875#issuecomment-380141516
 
 
   This is similar to what we had in iron-arrow before the contribution to the 
main arrow repo but it seemed that @andygrove went a different direction 
initially.
   
   I just added windows support last night to memory.rs and friends.  Do we 
plan to isolate libc calls to this new memory pool (replacing the memory.rs 
approach)?
   
   Just wondering what the direction is as the above memory pool approach does 
not work for windows.
   
   I can add windows support for memory pool also once it's merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add memory pool abstraction.
> ---
>
> Key: ARROW-2435
> URL: https://issues.apache.org/jira/browse/ARROW-2435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.9.0
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add memory pool abstraction as the c++ api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432484#comment-16432484
 ] 

Antoine Pitrou commented on ARROW-1171:
---

I'm assuming the issue is that the libstdc++ Arrow was (statically or not) 
linked with also gets used at runtime to resolve symbols in turbodbc (or the 
reverse)? Am I right?

Apparently another solution would be to use something called a "version script" 
to control which symbols are exported from the Arrow library, so that the 
static libstdc++ doesn't participate in dynamic symbol resolution:

[https://www.technovelty.org/c/what-exactly-does-bsymblic-do.html]

[http://anadoxin.org/blog/control-over-symbol-exports-in-gcc.html]

(I may be speaking cluelessly here)

> C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and 
> self-compiled turbodbc
> 
>
> Key: ARROW-1171
> URL: https://issues.apache.org/jira/browse/ARROW-1171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.4.1
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Original issue: https://github.com/blue-yonder/turbodbc/issues/102
> When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora 
> 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute 
> here is that the compiler version used for building Turbodbc is newer than 
> the one used for Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2434) [Rust] Add windows support

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432481#comment-16432481
 ] 

ASF GitHub Bot commented on ARROW-2434:
---

paddyhoran commented on issue #1873: ARROW-2434: [Rust] Add windows support
URL: https://github.com/apache/arrow/pull/1873#issuecomment-380146676
 
 
   Ah I see.  Thanks @xhochy, sorry for the silly question...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add windows support
> --
>
> Key: ARROW-2434
> URL: https://issues.apache.org/jira/browse/ARROW-2434
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Currently `cargo test` fails on windows OS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2444) Better handle reading empty parquet files

2018-04-10 Thread Jim Crist (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Crist updated ARROW-2444:
-
Description: 
>From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]

 

Currently pyarrow reads empty parts as float64, even if the underlying columns 
have other dtypes. This can cause problems for pandas downstream, as certain 
operations are only valid on certain dtypes, even if the columns are empty.

 

Copying the comment Uwe over:

 
bq. {quote}This is the expected behaviour as an empty string column in Pandas 
is simply an empty column of type object. Sadly object does not tell us much 
about the type of the column at all. We return numpy.float64 in this case as 
it's the most efficient type to store nulls in Pandas.{quote}

{quote}This seems unintuitive at best to me. An empty object column in pandas 
is treated differently in many operations than an empty float64 column (str 
accessor is available, excluded from numeric operations, etc..). Having an 
empty file read in as a different dtype than was written could lead to errors 
in processing code downstream. Would arrow be willing to change this 
behavior?{quote}

We should probably use another method than `field.type.to_pandas_dtype()` in 
this case. The column saved in Parquet should be saved with `NA` as type which 
sadly does not provide enough information. 

We also store the original dtype in the Pandas metadata that is used for the 
actual DataFrame reconstruction later on. If we would also pick up the metadata 
when it was written, we should be able to correctly reconstruct the dtype.

  was:
>From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]

 

Currently pyarrow reads empty parts as float64, even if the underlying columns 
have other dtypes. This can cause problems for pandas downstream, as certain 
operations are only valid on certain dtypes, even if the columns are empty.

 

Copying the comment Uwe over:

 
bq. {quote}This is the expected behaviour as an empty string column in Pandas 
is simply an empty column of type object. Sadly object does not tell us much 
about the type of the column at all. We return numpy.float64 in this case as 
it's the most efficient type to store nulls in Pandas.{quote}

{quote}This seems unintuitive at best to me. An empty object column in pandas 
is treated differently in many operations than an empty float64 column (str 
accessor is available, excluded from numeric operations, etc..). Having an 
empty file read in as a different dtype than was written could lead to errors 
in processing code downstream. Would arrow be willing to change this 
behavior?{quote}

We should probably use another method than `field.type.to_pandas_dtype()` in 
this case. The column saved in Parquet should be saved with `NA` as type which 
sadly does not provide enough information. 

We also store the original dtype in the Pandas metadata that is used for the 
actual DataFrame reconstruction later on. If we would also pick up the metadata 
when it was written, we should be able to correctly reconstruct the dtype.
{quote}


> Better handle reading empty parquet files
> -
>
> Key: ARROW-2444
> URL: https://issues.apache.org/jira/browse/ARROW-2444
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jim Crist
>Priority: Major
>
> From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]
>  
> Currently pyarrow reads empty parts as float64, even if the underlying 
> columns have other dtypes. This can cause problems for pandas downstream, as 
> certain operations are only valid on certain dtypes, even if the columns are 
> empty.
>  
> Copying the comment Uwe over:
>  
> bq. {quote}This is the expected behaviour as an empty string column in Pandas 
> is simply an empty column of type object. Sadly object does not tell us much 
> about the type of the column at all. We return numpy.float64 in this case as 
> it's the most efficient type to store nulls in Pandas.{quote}
> {quote}This seems unintuitive at best to me. An empty object column in pandas 
> is treated differently in many operations than an empty float64 column (str 
> accessor is available, excluded from numeric operations, etc..). Having an 
> empty file read in as a different dtype than was written could lead to errors 
> in processing code downstream. Would arrow be willing to change this 
> behavior?{quote}
> We should probably use another method than `field.type.to_pandas_dtype()` in 
> this case. The column saved in Parquet should be saved with `NA` as type 
> which sadly does not provide enough information. 
> We also store the original dtype in the Pandas metadata that is used for the 
> actual DataFrame reconstruction later on. If we would also pick up the 
> metadata when it was written, we should be able t

[jira] [Created] (ARROW-2444) Better handle reading empty parquet files

2018-04-10 Thread Jim Crist (JIRA)
Jim Crist created ARROW-2444:


 Summary: Better handle reading empty parquet files
 Key: ARROW-2444
 URL: https://issues.apache.org/jira/browse/ARROW-2444
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jim Crist


>From [https://github.com/dask/dask/pull/3387#issuecomment-380140003]

 

Currently pyarrow reads empty parts as float64, even if the underlying columns 
have other dtypes. This can cause problems for pandas downstream, as certain 
operations are only valid on certain dtypes, even if the columns are empty.

 

Copying the comment Uwe over:

 
bq. {quote}This is the expected behaviour as an empty string column in Pandas 
is simply an empty column of type object. Sadly object does not tell us much 
about the type of the column at all. We return numpy.float64 in this case as 
it's the most efficient type to store nulls in Pandas.{quote}

{quote}This seems unintuitive at best to me. An empty object column in pandas 
is treated differently in many operations than an empty float64 column (str 
accessor is available, excluded from numeric operations, etc..). Having an 
empty file read in as a different dtype than was written could lead to errors 
in processing code downstream. Would arrow be willing to change this 
behavior?{quote}

We should probably use another method than `field.type.to_pandas_dtype()` in 
this case. The column saved in Parquet should be saved with `NA` as type which 
sadly does not provide enough information. 

We also store the original dtype in the Pandas metadata that is used for the 
actual DataFrame reconstruction later on. If we would also pick up the metadata 
when it was written, we should be able to correctly reconstruct the dtype.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461
 ] 

Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 3:29 PM:
-

What's the expected output? It should be able to hold `2^31 - 1` bytes?

According to my investigation 
https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44
 limits to `((2^31) - 1) - 1`


was (Author: kszucs):
What's the expected output? It should be able to hold `2^31 - 1` bytes?

According to my investigation 
https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44
 limits to `(2^31) - 1`

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461
 ] 

Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 3:29 PM:
-

What's the expected output? It should be able to hold `2^31 - 1` bytes?

According to my investigation the limit is set to `((2^31) - 1) - 1`.

https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44


was (Author: kszucs):
What's the expected output? It should be able to hold `2^31 - 1` bytes?

According to my investigation 
https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44
 limits to `((2^31) - 1) - 1`

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461
 ] 

Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 3:28 PM:
-

What's the expected output? It should be able to hold `2^31 - 1` bytes?

According to my investigation 
https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44
 limits to `(2^31) - 1`


was (Author: kszucs):
What's the expected output?

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity

2018-04-10 Thread Krisztian Szucs (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461
 ] 

Krisztian Szucs commented on ARROW-2367:


What's the expected output?

> ListArray has trouble with sizes greater than kMaximumCapacity
> --
>
> Key: ARROW-2367
> URL: https://issues.apache.org/jira/browse/ARROW-2367
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Bryant Menn
>Priority: Major
>
> When creating a Pandas dataframe with lists as elements as a column the 
> following error occurs when converting to a {{pyarrow.Table}} object.
> {code}
> Traceback (most recent call last):
> File "arrow-2227.py", line 16, in 
> arr = pa.array(df['strings'], from_pandas=True)
> File "array.pxi", line 177, in pyarrow.lib.array
> File "error.pxi", line 77, in pyarrow.lib.check_status
> File "error.pxi", line 77, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 
> bytes, have 2147483647
> {code}
> The following code was used to generate the error (adapted from ARROW-2227):
> {code}
> import pandas as pd
> import pyarrow as pa
> # Commented lines were used to test non-binary data types, both cause the 
> same error
> v1 = b'x' * 1
> v2 = b'x' * 147483646
> # v1 = 'x' * 1
> # v2 = 'x' * 147483646
> df = pd.DataFrame({
>  'strings': [[v1]] * 20 + [[v2]] + [[b'x']]
>  # 'strings': [[v1]] * 20 + [[v2]] + [['x']]
> })
> arr = pa.array(df['strings'], from_pandas=True)
> assert isinstance(arr, pa.ChunkedArray), type(arr)
> {code}
> Code was run using Python 3.6 with PyArrow installed from conda-forge on 
> macOS High Sierra.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432458#comment-16432458
 ] 

ASF GitHub Bot commented on ARROW-2435:
---

paddyhoran commented on issue #1875: ARROW-2435: [Rust] Add memory pool 
abstraction.
URL: https://github.com/apache/arrow/pull/1875#issuecomment-380141516
 
 
   This is similar to what we had in iron-arrow before the contribution to the 
main arrow repo but it seemed that @andygrove went a different direction 
initially.
   
   I just added windows support last night to memory.rs and friends.  Do we 
plan to isolate libc calls to this new memory pool (replacing the memory.rs 
approach)?
   
   Just wondering what the direction is as the above memory pool approach does 
not work from windows.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add memory pool abstraction.
> ---
>
> Key: ARROW-2435
> URL: https://issues.apache.org/jira/browse/ARROW-2435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.9.0
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
>
> Add memory pool abstraction as the c++ api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc

2018-04-10 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432457#comment-16432457
 ] 

Uwe L. Korn commented on ARROW-1171:


It will not end the issues but help us to move to a more modern compiler for 
the {{manylinux1}} packages. We will still get issues when the wrong symbols 
are pulled in, probably for a lot less symbols than currently.

> C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and 
> self-compiled turbodbc
> 
>
> Key: ARROW-1171
> URL: https://issues.apache.org/jira/browse/ARROW-1171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.4.1
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Original issue: https://github.com/blue-yonder/turbodbc/issues/102
> When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora 
> 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute 
> here is that the compiler version used for building Turbodbc is newer than 
> the one used for Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc

2018-04-10 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432452#comment-16432452
 ] 

Antoine Pitrou commented on ARROW-1171:
---

As a sidenote, a new manylinux standard is being developed with a more recent 
baseline (CentOS 6) :
https://www.python.org/dev/peps/pep-0571/

The PEP is still being discussed but chances are it will be accepted soon:
https://mail.python.org/pipermail/distutils-sig/2018-March/032102.html

Perhaps it will end the libstdc++ issues (or perhaps not :-)).

> C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and 
> self-compiled turbodbc
> 
>
> Key: ARROW-1171
> URL: https://issues.apache.org/jira/browse/ARROW-1171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.4.1
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Original issue: https://github.com/blue-yonder/turbodbc/issues/102
> When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora 
> 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute 
> here is that the compiler version used for building Turbodbc is newer than 
> the one used for Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2434) [Rust] Add windows support

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432453#comment-16432453
 ] 

ASF GitHub Bot commented on ARROW-2434:
---

xhochy commented on issue #1873: ARROW-2434: [Rust] Add windows support
URL: https://github.com/apache/arrow/pull/1873#issuecomment-380140599
 
 
   @paddyhoran because it lists only the top 100 contributors (not sure how the 
contributors with only one contribution are sorted). You will be definitely 
listed on the page once you make a second contribution.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add windows support
> --
>
> Key: ARROW-2434
> URL: https://issues.apache.org/jira/browse/ARROW-2434
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Currently `cargo test` fails on windows OS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2434) [Rust] Add windows support

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432446#comment-16432446
 ] 

ASF GitHub Bot commented on ARROW-2434:
---

paddyhoran commented on issue #1873: ARROW-2434: [Rust] Add windows support
URL: https://github.com/apache/arrow/pull/1873#issuecomment-380138524
 
 
   @xhochy do you know why I am not listed on the 
[contributors](https://github.com/apache/arrow/graphs/contributors) page after 
this getting merged?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Rust] Add windows support
> --
>
> Key: ARROW-2434
> URL: https://issues.apache.org/jira/browse/ARROW-2434
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Currently `cargo test` fails on windows OS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432390#comment-16432390
 ] 

ASF GitHub Bot commented on ARROW-2330:
---

xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer 
creation with partially finishable array builders
URL: https://github.com/apache/arrow/pull/1769#issuecomment-380127439
 
 
   Hello Dimitri,
   
   I forgot to reply to you. This got drowned in all the other Arrow activity 
in the last days. Sorry about that.
   
   > Do you know why the RecordBatchBuilder has a partial Flush method? Is it 
because its instantion is more cumbersome compared to array builders?
   
   The `RecordBatchBuilder` is different to the other builders as it does not 
generate a Arrays but should be used for a more streaming-like output. In it, 
all builders are resetted on Flush. This is also the behaviour I would expect 
in the dictionary builder.

   > That being said, I don't think that it adds that much complexity, compared 
to the previous implementation. Most of the changes are in the testing code, 
and builder files themselves have around 30 additional LOC. At the same time, 
it makes the DictionaryBuilder quite straight forward.
   
   The added complexity is the thing that scares me most at the moment. The 
Builders are built with the intention that they complety own their buffers and 
can modify them as they like internally. You can ask them for snapshot access 
to elements to inspect them but we don't provide any guarantees that the memory 
addresses that were returned from them are still valid once you can a method on 
the builder again.
   
   As far as I understand the current implementation in some cases we return 
the current buffer pointer as new `Buffer` object in the BufferBuilder. Once we 
do expand it though later on, we may free this memory address and thus the 
`Buffer` object point to unreachable memory. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Optimize delta buffer creation with partially finishable array builders
> -
>
> Key: ARROW-2330
> URL: https://issues.apache.org/jira/browse/ARROW-2330
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The main aim of this change is to optimize the building of delta 
> dictionaries. In the current version delta dictionaries are built using an 
> additional "overflow" buffer which leads to complicated and potentially 
> error-prone code and subpar performance by doubling the number of lookups.
> I solve this problem by introducing the notion of partially finishable array 
> builders, i.e. builder which are able to retain the state on calling Finish. 
> The interface is based on RecordBatchBuilder::Flush, i.e. Finish is 
> overloaded with additional signature Finish(bool reset_builder, 
> std::shared_ptr* out). The resulting Arrays point to the same data 
> buffer with different offsets.
> I'm aware that the change is kind of biggish, but I'd like to discuss it 
> here. The solution makes the code more straight forward, doesn't bloat the 
> code base too much and leaves the API more or less untouched. Additionally, 
> the new way to make delta dictionaries by using a different call signature to 
> Finish feel cleaner to me.
> I'm looking forward to your critic and improvement ideas.
> The pull request is available at: https://github.com/apache/arrow/pull/1769



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2443) [Python] Conversion from pandas of empty categorical fails with ArrowInvalid

2018-04-10 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2443:
-

 Summary: [Python] Conversion from pandas of empty categorical 
fails with ArrowInvalid
 Key: ARROW-2443
 URL: https://issues.apache.org/jira/browse/ARROW-2443
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Florian Jetter


The conversion of an empty pandas categorical raises an exception. Before 
version `0.9.0` this was possible
{code:java}
import pandas as pd
import pyarrow as pa
pa.Table.from_pandas(pd.DataFrame({'cat': pd.Categorical([])})){code}
raises:

{{ArrowInvalid: Dictionary indices must have non-zero length}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2222) [C++] Add option to validate Flatbuffers messages

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432283#comment-16432283
 ] 

ASF GitHub Bot commented on ARROW-:
---

crepererum commented on issue #1763: ARROW-: handle untrusted inputs (POC)
URL: https://github.com/apache/arrow/pull/1763#issuecomment-380104686
 
 
   I would say the performance impact is negligible.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add option to validate Flatbuffers messages
> -
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Marco Neumann
>Priority: Major
>  Labels: pull-request-available
>
> This is follow up work to ARROW-1589, ARROW-2023, and can be validated by the 
> {{ipc-fuzzer-test}}. Users receiving untrusted input streams can prevent 
> segfaults this way
> As part of this, we should quantify the overhead associated with message 
> validation in regular use



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2222) [C++] Add option to validate Flatbuffers messages

2018-04-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432282#comment-16432282
 ] 

ASF GitHub Bot commented on ARROW-:
---

crepererum commented on issue #1763: ARROW-: handle untrusted inputs (POC)
URL: https://github.com/apache/arrow/pull/1763#issuecomment-380103831
 
 
   ## Environment
   
   ### Build Commands
   
   ```
   env CC=clang CXX=clang++ cmake -DCMAKE_BUILD_TYPE=Release 
-DARROW_BUILD_BENCHMARKS=ON -GNinja ..
   env CC=clang CXX=clang++ ninja
   ```
   
   ### Conda
   ```
   # NameVersion   Build  Channel
   boost 1.66.0   py36_1conda-forge
   boost-cpp 1.66.01conda-forge
   bzip2 1.0.6 1conda-forge
   ca-certificates   2018.1.18 0conda-forge
   certifi   2018.1.18py36_0conda-forge
   clangdev  6.0.0 default_0conda-forge
   cmake 3.11.00conda-forge
   curl  7.59.01conda-forge
   expat 2.2.5 0conda-forge
   icu   58.2  0conda-forge
   intel-openmp  2018.0.0  8
   krb5  1.14.60conda-forge
   libcxx6.0.0 0conda-forge
   libgfortran   3.0.1h93005f0_2
   libiconv  1.15  0conda-forge
   libssh2   1.8.0 2conda-forge
   libuv 1.20.00conda-forge
   libxml2   2.9.8 0conda-forge
   llvm-meta 6.0.0 0conda-forge
   llvmdev   6.0.0 default_0conda-forge
   mkl   2018.0.2  1
   mkl_fft   1.0.1py36_1conda-forge
   mkl_random1.0.1py36_0conda-forge
   ncurses   5.9  10conda-forge
   ninja 1.8.2 1conda-forge
   numpy 1.14.2   py36ha9ae307_1
   openssl   1.0.2n0conda-forge
   pip   9.0.3py36_0conda-forge
   python3.6.5 1conda-forge
   readline  7.0   0conda-forge
   rhash 1.3.4 0conda-forge
   setuptools39.0.1   py36_0conda-forge
   sqlite3.20.12conda-forge
   tk8.6.7 0conda-forge
   wheel 0.31.0   py36_0conda-forge
   xz5.2.3 0conda-forge
   zlib  1.2.110conda-forge
   ```
   
   ## Test Results
   
   ### complete PR
   ```
   Run on (4 X 2700 MHz CPU s)
   2018-04-10 15:33:15
   Benchmark  Time   
CPU Iterations
   

   BM_WriteRecordBatch/1/min_time:1.000/real_time 76462 ns  76204 
ns  16080   12.7719GB/s
   BM_WriteRecordBatch/4/min_time:1.000/real_time 63077 ns  62791 
ns  2122915.482GB/s
   BM_WriteRecordBatch/16/min_time:1.000/real_time56850 ns  56584 
ns  24134   17.1778GB/s
   BM_WriteRecordBatch/64/min_time:1.000/real_time58209 ns  58003 
ns  22891   16.7768GB/s
   BM_WriteRecordBatch/256/min_time:1.000/real_time  105328 ns 104930 
ns  12450   9.27162GB/s
   BM_WriteRecordBatch/1024/min_time:1.000/real_time 247453 ns 245516 
ns   5383   3.94645GB/s
   BM_WriteRecordBatch/4k/min_time:1.000/real_time   904931 ns 900426 
ns   1384   1105.06MB/s
   BM_WriteRecordBatch/8k/min_time:1.000/real_time  1869873 ns1861388 
ns704   534.796MB/s
   BM_ReadRecordBatch/1/min_time:1.000/real_time   2409 ns   2402 
ns 584482   405.366GB/s
   BM_ReadRecordBatch/4/min_time:1.000/real_time   4919 ns   4888 
ns 284403   198.544GB/s
   BM_ReadRecordBatch/16/min_time:1.000/real_time 14176 ns  14123 
ns  73649   68.8866GB/s
   BM_ReadRecordBatch/64/min_time:1.000/real_time 51842 ns

  1   2   >