[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433402#comment-16433402 ] ASF GitHub Bot commented on ARROW-2445: --- xhochy commented on a change in pull request #1881: ARROW-2445: [Rust] Add documentation and make some fields private URL: https://github.com/apache/arrow/pull/1881#discussion_r180635435 ## File path: rust/src/array.rs ## @@ -86,13 +90,25 @@ impl Array { } } +/// Get a reference to the array data pub fn data(&self) -> &ArrayData { &self.data } +/// number of elements in the array pub fn len(&self) -> usize { self.len as usize } + +/// number of null elements in the array +pub fn null_count(&self) -> usize { +self.null_count as usize +} + +/// If null_count is greater than zero then the validity_bitmap will be Some(Bitmap) +pub fn validity_bitmap(&self) -> &Option { Review comment: We have seen some cases in C++ that it is beneficial in a lot of cases to work directly on the bitmap as the compiler can drastically speedup access to it using SIMD instructions. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2407) [GLib] Add garrow_string_array_builder_append_values()
[ https://issues.apache.org/jira/browse/ARROW-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433351#comment-16433351 ] ASF GitHub Bot commented on ARROW-2407: --- kou closed pull request #1845: ARROW-2407: [GLib] Add garrow_string_array_builder_append_values() URL: https://github.com/apache/arrow/pull/1845 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/c_glib/arrow-glib/array-builder.cpp b/c_glib/arrow-glib/array-builder.cpp index 7625bcd51..f12999881 100644 --- a/c_glib/arrow-glib/array-builder.cpp +++ b/c_glib/arrow-glib/array-builder.cpp @@ -443,7 +443,7 @@ garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, * @is_valids_length: The length of `is_valids`. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple values at once. It's efficient than multiple + * Append multiple values at once. It's more efficient than multiple * `append()` and `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -495,7 +495,7 @@ garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, * @n: The number of null values to be appended. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple nulls at once. It's efficient than multiple + * Append multiple nulls at once. It's more efficient than multiple * `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -580,7 +580,7 @@ garrow_int_array_builder_append(GArrowIntArrayBuilder *builder, * @is_valids_length: The length of `is_valids`. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple values at once. It's efficient than multiple + * Append multiple values at once. It's more efficient than multiple * `append()` and `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -630,7 +630,7 @@ garrow_int_array_builder_append_null(GArrowIntArrayBuilder *builder, * @n: The number of null values to be appended. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple nulls at once. It's efficient than multiple + * Append multiple nulls at once. It's more efficient than multiple * `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -715,7 +715,7 @@ garrow_uint_array_builder_append(GArrowUIntArrayBuilder *builder, * @is_valids_length: The length of `is_valids`. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple values at once. It's efficient than multiple + * Append multiple values at once. It's more efficient than multiple * `append()` and `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -765,7 +765,7 @@ garrow_uint_array_builder_append_null(GArrowUIntArrayBuilder *builder, * @n: The number of null values to be appended. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple nulls at once. It's efficient than multiple + * Append multiple nulls at once. It's more efficient than multiple * `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -845,7 +845,7 @@ garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, * @is_valids_length: The length of `is_valids`. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple values at once. It's efficient than multiple + * Append multiple values at once. It's more efficient than multiple * `append()` and `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -893,7 +893,7 @@ garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, * @n: The number of null values to be appended. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple nulls at once. It's efficient than multiple + * Append multiple nulls at once. It's more efficient than multiple * `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -973,7 +973,7 @@ garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, * @is_valids_length: The length of `is_valids`. * @error: (nullable): Return location for a #GError or %NULL. * - * Append multiple values at once. It's efficient than multiple + * Append multiple values at once. It's more efficient than multiple * `append()` and `append_null()` calls. * * Returns: %TRUE on success, %FALSE if there was an error. @@ -1021,7 +1021,7 @@ garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, * @n: The number of null values
[jira] [Resolved] (ARROW-2407) [GLib] Add garrow_string_array_builder_append_values()
[ https://issues.apache.org/jira/browse/ARROW-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-2407. - Resolution: Fixed Issue resolved by pull request 1845 [https://github.com/apache/arrow/pull/1845] > [GLib] Add garrow_string_array_builder_append_values() > -- > > Key: ARROW-2407 > URL: https://issues.apache.org/jira/browse/ARROW-2407 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Affects Versions: 0.9.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1642#comment-1642 ] ASF GitHub Bot commented on ARROW-2445: --- liurenjie1024 commented on a change in pull request #1881: ARROW-2445: [Rust] Add documentation and make some fields private URL: https://github.com/apache/arrow/pull/1881#discussion_r180619565 ## File path: rust/src/array.rs ## @@ -69,10 +69,14 @@ arraydata_from_primitive!(u32, UInt32); arraydata_from_primitive!(u64, UInt64); pub struct Array { -pub len: i32, -pub null_count: i32, -pub validity_bitmap: Option, -pub data: ArrayData, +/// number of elements in the array +len: i32, +/// number of null elements in the array +null_count: i32, +/// If null_count is greater than zero then the validity_bitmap will be Some(Bitmap) +validity_bitmap: Option, +/// The array of elements +data: ArrayData, Review comment: Add a field for data type? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1641#comment-1641 ] ASF GitHub Bot commented on ARROW-2445: --- liurenjie1024 commented on a change in pull request #1881: ARROW-2445: [Rust] Add documentation and make some fields private URL: https://github.com/apache/arrow/pull/1881#discussion_r180619803 ## File path: rust/src/array.rs ## @@ -86,13 +90,25 @@ impl Array { } } +/// Get a reference to the array data pub fn data(&self) -> &ArrayData { &self.data } +/// number of elements in the array pub fn len(&self) -> usize { self.len as usize } + +/// number of null elements in the array +pub fn null_count(&self) -> usize { +self.null_count as usize +} + +/// If null_count is greater than zero then the validity_bitmap will be Some(Bitmap) +pub fn validity_bitmap(&self) -> &Option { Review comment: I don't think it's a good idea to expose validity bitmap directly to user. I think we should just add methods to test whether a position is set. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-2445: -- Summary: [Rust] Add documentation and make some fields private (was: Add documentation and make some fields private) > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433305#comment-16433305 ] ASF GitHub Bot commented on ARROW-2445: --- andygrove opened a new pull request #1881: ARROW-2445: [Rust] Add documentation and make some fields private URL: https://github.com/apache/arrow/pull/1881 No functional changes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433306#comment-16433306 ] Andy Grove commented on ARROW-2445: --- PR: https://github.com/apache/arrow/pull/1881 > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2445) [Rust] Add documentation and make some fields private
[ https://issues.apache.org/jira/browse/ARROW-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2445: -- Labels: pull-request-available (was: ) > [Rust] Add documentation and make some fields private > - > > Key: ARROW-2445 > URL: https://issues.apache.org/jira/browse/ARROW-2445 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.10.0 > > > A first pass at adding rustdoc comments and made some struct fields private > and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2445) Add documentation and make some fields private
Andy Grove created ARROW-2445: - Summary: Add documentation and make some fields private Key: ARROW-2445 URL: https://issues.apache.org/jira/browse/ARROW-2445 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Andy Grove Assignee: Andy Grove Fix For: 0.10.0 A first pass at adding rustdoc comments and made some struct fields private and added accessor methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.
[ https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433294#comment-16433294 ] ASF GitHub Bot commented on ARROW-2435: --- liurenjie1024 commented on a change in pull request #1875: ARROW-2435: [Rust] Add memory pool abstraction. URL: https://github.com/apache/arrow/pull/1875#discussion_r180616961 ## File path: rust/src/memory_pool.rs ## @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use libc; +use std::mem; +use std::cmp; + +use super::error::ArrowError; +use super::error::Result; + +const ALIGNMENT: usize = 64; + +/// Memory pool for allocating memory. It's also responsible for tracking memory usage. +pub trait MemoryPool { +/// Allocate memory. +/// The implementation should ensures that allocated memory is aligned. +fn allocate(&self, size: usize) -> Result<*mut u8>; + +/// Reallocate memory. +/// If the implementation doesn't support reallocating aligned memory, it allocates new memory +/// and copied old memory to it. +fn reallocate(&self, old_size: usize, new_size: usize, pointer: *mut u8) -> Result<*mut u8>; + +/// Free memory. +fn free(&self, ptr: *mut u8); +} + +/// Implementation of memory pool using lib api. +#[allow(dead_code)] +struct LibcMemoryPool; + +impl MemoryPool for LibcMemoryPool { +fn allocate(&self, size: usize) -> Result<*mut u8> { +unsafe { +let mut page: *mut libc::c_void = mem::uninitialized(); +let result = libc::posix_memalign(&mut page, ALIGNMENT, size); +match result { +0 => Ok(mem::transmute::<*mut libc::c_void, *mut u8>(page)), +_ => Err(ArrowError::MemoryError( +"Failed to allocate memory".to_string(), +)), +} +} +} + +fn reallocate(&self, old_size: usize, new_size: usize, pointer: *mut u8) -> Result<*mut u8> { +unsafe { +let result = self.allocate(new_size)?; +let dst = mem::transmute::<*mut u8, *mut libc::c_void>(result); +libc::memcpy( +dst, +mem::transmute::<*mut u8, *const libc::c_void>(pointer), +cmp::min(old_size, new_size), +); Review comment: Thanks for the review, and I've fixed it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add memory pool abstraction. > --- > > Key: ARROW-2435 > URL: https://issues.apache.org/jira/browse/ARROW-2435 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.9.0 >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > > Add memory pool abstraction as the c++ api. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2407) [GLib] Add garrow_string_array_builder_append_values()
[ https://issues.apache.org/jira/browse/ARROW-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433217#comment-16433217 ] ASF GitHub Bot commented on ARROW-2407: --- kou commented on issue #1845: ARROW-2407: [GLib] Add garrow_string_array_builder_append_values() URL: https://github.com/apache/arrow/pull/1845#issuecomment-380286802 OK. I've rebased and changed to use the new `Append()` method. If CI is passed, I'll merge this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [GLib] Add garrow_string_array_builder_append_values() > -- > > Key: ARROW-2407 > URL: https://issues.apache.org/jira/browse/ARROW-2407 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Affects Versions: 0.9.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433202#comment-16433202 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180601792 ## File path: cpp/src/arrow/python/decimal.cc ## @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc } Status DecimalMetadata::Update(PyObject* object) { - DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; + bool is_decimal = PyDecimal_Check(object); Review comment: So you mean remove `PyDecimal_Check` all together? This is only called when the type is not specified by the user and then yes, it will end doing 2 passes over the objects and checks both times if they are decimal. It might be possible to do less checks on the second pass if we keep a list of which ones are decimal objects, but I'm not sure that would be worth it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433206#comment-16433206 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180602058 ## File path: cpp/src/arrow/python/decimal.cc ## @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc } Status DecimalMetadata::Update(PyObject* object) { - DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; + bool is_decimal = PyDecimal_Check(object); Review comment: Fair enough, we can optimize later if we find it too slow. The conversion itself is very slow anyway :-) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433182#comment-16433182 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380279472 > I would think it's okay to remove them if they weren't there before the PR that introduced boost-regex. It seems the boost-regex references in the Python build files were introduced specifically for parquet: see cb5da9c1 and 8b1c8118. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433180#comment-16433180 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278833 Oh, you're right, I had misread the regex (it has `[-+]?` for the exponent part). The behaviour is unchanged then. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433178#comment-16433178 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278645 > By the way, I don't know if it's ok to remove boost-regex references from the Python build files, since parquet still requires boost-regex AFAIR. @xhochy Can you comment on this? I would think it's okay to remove them if they weren't there before the PR that introduced boost-regex. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433175#comment-16433175 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278461 I think the current version supports that, does it not? I remember adding that. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433176#comment-16433176 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278524 I may be misunderstand the usage of "should" here :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433174#comment-16433174 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380278313 Another note: the new implementation should support exponents without a sign, e.g. "1.23e45". The Python decimal module also does this. I need to add tests, though. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433170#comment-16433170 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on issue #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#issuecomment-380277704 By the way, I don't know if it's ok to remove boost-regex references from the Python build files, since parquet still requires boost-regex AFAIR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433160#comment-16433160 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180594466 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { Review comment: By no means should you feel obliged to do that extra work :) If this works, then that's great. I thought maybe you had chosen not to use iterators for a particular reason. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433158#comment-16433158 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180594140 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { + size_t pos = 0; -if (scale != nullptr) { - *scale = 0; + if (size == 0) { +return false; + } + // Sign of the number + if (IsSign(s[pos])) { +out->sign = std::string(s + pos, 1); +++pos; + } + // First run of digits + pos = ParseDigitsRun(s, pos, size, &out->whole_digits); + if (pos == size) { +return !out->whole_digits.empty(); + } + // Optional dot (if given in fractional form) + bool has_dot = IsDot(s[pos]); + if (has_dot) { +// Second run of digits +++pos; +pos = ParseDigitsRun(s, pos, size, &out->fractional_digits); + } + if (out->whole_digits.empty() && out->fractional_digits.empty()) { +// Need at least some digits (whole or fractional) +return false; + } + if (pos == size) { +return true; + } + // Optional exponent + if (StartsExponent(s[pos])) { +++pos; +if (pos == size) { + return false; +} +// Optional exponent sign +if (IsSign(s[pos])) { + out->exponent_sign = std::string(s + pos, 1); + ++pos; +} +pos = ParseDigitsRun(s, pos, size, &out->exponent_digits); +if (out->exponent_digits.empty()) { + // Need some exponent digits + return false; } - -*out = 0; -return Status::OK(); } + return pos == size; +} - boost::smatch results; - const bool matches = boost::regex_match(s, results, DECIMAL_REGEX); +} // namespace - if (!matches) { -std::stringstream ss; -ss << "The string " << s << " is not a valid decimal number"; -return Status::Invalid(ss.str()); +Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, + int32_t* scale) { + if (s.empty()) { +return Status::Invalid("Empty string cannot be converted to decimal"); } - const std::string sign = results["SIGN"]; - const std::string integer = results["INTEGER"]; - - const std::string left_digits = results["LEFT_DIGITS"]; - const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"]; - - const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"]; - - const std::string first_exp_value = results["FIRST_EXP_VALUE"]; - const std::string second_exp_value = results["SECOND_EXP_VALUE"]; - - std::string whole_part; - std::string fractional_part; - std::string exponent_value; - - if (!integer.empty()) { -whole_part = integer; - } else if (!left_digits.empty()) { -DCHECK(second_right_digits.empty()) << s << " " << second_right_digits; -whole_part = left_digits; -fractional_part = first_right_digits; - } else { -DCHECK(first_right_digits.empty()) << s << " " << first_right_digits; -fractional_part = second_right_digits; + DecimalComponents dec; + if (!ParseDecimalComponents
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433155#comment-16433155 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180593829 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { Review comment: I'll try an iterator version tomorrow and see if I like it better. It's a bit late tonight ;-) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433154#comment-16433154 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180593708 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { Review comment: Oh, I just mean that iterators abstract over pointer arithmetic and are built into the STL. It feels like C as opposed to C++, that's all. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433138#comment-16433138 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180592157 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } Review comment: don't need parens here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433150#comment-16433150 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180593109 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { + size_t pos = 0; -if (scale != nullptr) { - *scale = 0; + if (size == 0) { +return false; + } + // Sign of the number + if (IsSign(s[pos])) { +out->sign = std::string(s + pos, 1); +++pos; + } + // First run of digits + pos = ParseDigitsRun(s, pos, size, &out->whole_digits); + if (pos == size) { +return !out->whole_digits.empty(); + } + // Optional dot (if given in fractional form) + bool has_dot = IsDot(s[pos]); + if (has_dot) { +// Second run of digits +++pos; +pos = ParseDigitsRun(s, pos, size, &out->fractional_digits); + } + if (out->whole_digits.empty() && out->fractional_digits.empty()) { +// Need at least some digits (whole or fractional) +return false; + } + if (pos == size) { +return true; + } + // Optional exponent + if (StartsExponent(s[pos])) { +++pos; +if (pos == size) { + return false; +} +// Optional exponent sign +if (IsSign(s[pos])) { + out->exponent_sign = std::string(s + pos, 1); + ++pos; +} +pos = ParseDigitsRun(s, pos, size, &out->exponent_digits); +if (out->exponent_digits.empty()) { + // Need some exponent digits + return false; } - -*out = 0; -return Status::OK(); } + return pos == size; +} - boost::smatch results; - const bool matches = boost::regex_match(s, results, DECIMAL_REGEX); +} // namespace - if (!matches) { -std::stringstream ss; -ss << "The string " << s << " is not a valid decimal number"; -return Status::Invalid(ss.str()); +Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, + int32_t* scale) { + if (s.empty()) { +return Status::Invalid("Empty string cannot be converted to decimal"); } - const std::string sign = results["SIGN"]; - const std::string integer = results["INTEGER"]; - - const std::string left_digits = results["LEFT_DIGITS"]; - const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"]; - - const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"]; - - const std::string first_exp_value = results["FIRST_EXP_VALUE"]; - const std::string second_exp_value = results["SECOND_EXP_VALUE"]; - - std::string whole_part; - std::string fractional_part; - std::string exponent_value; - - if (!integer.empty()) { -whole_part = integer; - } else if (!left_digits.empty()) { -DCHECK(second_right_digits.empty()) << s << " " << second_right_digits; -whole_part = left_digits; -fractional_part = first_right_digits; - } else { -DCHECK(first_right_digits.empty()) << s << " " << first_right_digits; -fractional_part = second_right_digits; + DecimalComponents dec; + if (!ParseDecimalComponents(
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433147#comment-16433147 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180592851 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } Review comment: I was too lazy to look it up :-) Will do. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433146#comment-16433146 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180592739 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { Review comment: I find it much readable like that. I don't know what "subverting the language" means. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433145#comment-16433145 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180592574 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { + size_t pos = 0; -if (scale != nullptr) { - *scale = 0; + if (size == 0) { +return false; + } + // Sign of the number + if (IsSign(s[pos])) { +out->sign = std::string(s + pos, 1); +++pos; + } + // First run of digits + pos = ParseDigitsRun(s, pos, size, &out->whole_digits); + if (pos == size) { +return !out->whole_digits.empty(); + } + // Optional dot (if given in fractional form) + bool has_dot = IsDot(s[pos]); + if (has_dot) { +// Second run of digits +++pos; +pos = ParseDigitsRun(s, pos, size, &out->fractional_digits); + } + if (out->whole_digits.empty() && out->fractional_digits.empty()) { +// Need at least some digits (whole or fractional) +return false; + } + if (pos == size) { +return true; + } + // Optional exponent + if (StartsExponent(s[pos])) { +++pos; +if (pos == size) { + return false; +} +// Optional exponent sign +if (IsSign(s[pos])) { + out->exponent_sign = std::string(s + pos, 1); + ++pos; +} +pos = ParseDigitsRun(s, pos, size, &out->exponent_digits); +if (out->exponent_digits.empty()) { + // Need some exponent digits + return false; } - -*out = 0; -return Status::OK(); } + return pos == size; +} - boost::smatch results; - const bool matches = boost::regex_match(s, results, DECIMAL_REGEX); +} // namespace - if (!matches) { -std::stringstream ss; -ss << "The string " << s << " is not a valid decimal number"; -return Status::Invalid(ss.str()); +Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, + int32_t* scale) { + if (s.empty()) { +return Status::Invalid("Empty string cannot be converted to decimal"); } - const std::string sign = results["SIGN"]; - const std::string integer = results["INTEGER"]; - - const std::string left_digits = results["LEFT_DIGITS"]; - const std::string first_right_digits = results["FIRST_RIGHT_DIGITS"]; - - const std::string second_right_digits = results["SECOND_RIGHT_DIGITS"]; - - const std::string first_exp_value = results["FIRST_EXP_VALUE"]; - const std::string second_exp_value = results["SECOND_EXP_VALUE"]; - - std::string whole_part; - std::string fractional_part; - std::string exponent_value; - - if (!integer.empty()) { -whole_part = integer; - } else if (!left_digits.empty()) { -DCHECK(second_right_digits.empty()) << s << " " << second_right_digits; -whole_part = left_digits; -fractional_part = first_right_digits; - } else { -DCHECK(first_right_digits.empty()) << s << " " << first_right_digits; -fractional_part = second_right_digits; + DecimalComponents dec; + if (!ParseDecimalComponents
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433139#comment-16433139 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180592291 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } Review comment: Why not use `std::isdigit` for this? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.
[ https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433135#comment-16433135 ] ASF GitHub Bot commented on ARROW-2435: --- paddyhoran commented on issue #1875: ARROW-2435: [Rust] Add memory pool abstraction. URL: https://github.com/apache/arrow/pull/1875#issuecomment-380273948 As CI is not currently setup for Windows I think it’s fine to push ahead with this change without windows support, then reopen the windows support jira and I’ll address it in a separate PR. I seem to be the only person on Windows and momentum is more important at this stage of the rust implementation. Once we have windows CI setup I think we try to keep windows support working on all PRs. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add memory pool abstraction. > --- > > Key: ARROW-2435 > URL: https://issues.apache.org/jira/browse/ARROW-2435 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.9.0 >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > > Add memory pool abstraction as the c++ api. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433136#comment-16433136 ] ASF GitHub Bot commented on ARROW-2224: --- cpcloud commented on a change in pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880#discussion_r180591991 ## File path: cpp/src/arrow/util/decimal.cc ## @@ -253,117 +251,131 @@ static void StringToInteger(const std::string& str, Decimal128* out) { } } -static const boost::regex DECIMAL_REGEX( -// sign of the number -"(?[-+]?)" - -// digits around the decimal point - "(((?\\d+)\\.(?\\d*)|\\.(?\\d+)" -")" +namespace { -// optional exponent -"([eE](?[-+]?\\d+))?" +struct DecimalComponents { + std::string sign; + std::string whole_digits; + std::string fractional_digits; + std::string exponent_sign; + std::string exponent_digits; +}; -// otherwise -"|" +inline bool IsSign(char c) { return (c == '-' || c == '+'); } -// we're just an integer -"(?\\d+)" +inline bool IsDot(char c) { return c == '.'; } -// or an integer with an exponent -"(?:[eE](?[-+]?\\d+))?)"); +inline bool IsDigit(char c) { return (c >= '0' && c <= '9'); } -static inline bool is_zero_character(char c) { return c == '0'; } +inline bool StartsExponent(char c) { return (c == 'e' || c == 'E'); } -Status Decimal128::FromString(const std::string& s, Decimal128* out, int32_t* precision, - int32_t* scale) { - if (s.empty()) { -return Status::Invalid("Empty string cannot be converted to decimal"); +inline size_t ParseDigitsRun(const char* s, size_t start, size_t size, std::string* out) { + size_t pos; + for (pos = start; pos < size; ++pos) { +if (!IsDigit(s[pos])) { + break; +} } + *out = std::string(s + start, pos - start); + return pos; +} - // case of all zeros - if (std::all_of(s.cbegin(), s.cend(), is_zero_character)) { -if (precision != nullptr) { - *precision = 0; -} +bool ParseDecimalComponents(const char* s, size_t size, DecimalComponents* out) { Review comment: Why not use iterators instead of a pointer plus size? It seems like we're subverting the language. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2224: -- Labels: pull-request-available (was: ) > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433117#comment-16433117 ] ASF GitHub Bot commented on ARROW-2224: --- pitrou opened a new pull request #1880: [WIP] ARROW-2224: [C++] Remove boost-regex dependency URL: https://github.com/apache/arrow/pull/1880 Use a hand-written decimal parser. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432996#comment-16432996 ] Antoine Pitrou commented on ARROW-2224: --- Note that the day where we really care about decimal performance, we should probably use Stefan Krah's libmpdec, which is also used by the Python _decimal module. It's a pure C library, is fast and would probably allow us to have tighter interoperation with Python decimals. > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432995#comment-16432995 ] Phillip Cloud edited comment on ARROW-2224 at 4/10/18 9:11 PM: --- The original implementation used C++ iterators to implement a handwritten parser, but it quickly got complex once we needed to support integers with exponential notation. Maybe the complexity of dealing with boost is greater than the complexity of debugging a handwritten parser, in which case it would be worth it to reimplement it without a regular expression. was (Author: cpcloud): The original implementation used C++ to implement a handwritten parser, but it quickly got complex once we needed to support integers with exponential notation. Maybe the complexity of dealing with boost is greater than the complexity of debugging a handwritten parser, in which case it would be worth it to reimplement it without a regular expression. > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432995#comment-16432995 ] Phillip Cloud commented on ARROW-2224: -- The original implementation used C++ to implement a handwritten parser, but it quickly got complex once we needed to support integers with exponential notation. Maybe the complexity of dealing with boost is greater than the complexity of debugging a handwritten parser, in which case it would be worth it to reimplement it without a regular expression. > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432991#comment-16432991 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180568102 ## File path: cpp/src/arrow/python/decimal.cc ## @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc } Status DecimalMetadata::Update(PyObject* object) { - DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; + bool is_decimal = PyDecimal_Check(object); Review comment: Right now we are doing the check twice in optimized builds, which is not nice IMHO. `DecimalMetadata::Update` is a private API so it's up to the caller to provide appropriate input. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432987#comment-16432987 ] Antoine Pitrou commented on ARROW-2224: --- How important is it to have fast parsing of decimal types? If we reimplement our own parsing, we can tweak performance *and* remove the troublesome dependency. > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432965#comment-16432965 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180564291 ## File path: cpp/src/arrow/python/decimal.cc ## @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc } Status DecimalMetadata::Update(PyObject* object) { - DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; + bool is_decimal = PyDecimal_Check(object); Review comment: This isn't necessary because I added a check before calling `Update` but it does prevent a seg fault if for some reason it's called with non Decimal objects - which is not nice to get. If it's hurts an optimization though, I can remove it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432952#comment-16432952 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180562954 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { +RETURN_NOT_OK(max_decimal_metadata.Update(object)); + } } type_ = Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432946#comment-16432946 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180562328 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { Review comment: Seems like it could be NaN also: ``` In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')]) In [6]: s2 = pd.Series([Decimal('2.0'), None]) In [7]: s1 / s2 Out[7]: 00.5 1NaN dtype: object This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432939#comment-16432939 ] ASF GitHub Bot commented on ARROW-2432: --- cpcloud commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180561137 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { Review comment: Python decimal objects can be nan, unfortunately: ```python >>> import decimal >>> decimal.Decimal('nan') Decimal('NaN') ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432931#comment-16432931 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180559958 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { Review comment: I'm not sure, is it possible to get NaNs from operations on Decimals? Or is that something the user might mix in somehow? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432928#comment-16432928 ] Phillip Cloud commented on ARROW-2224: -- The "should" here isn't related to linking issues, rather, it's related to performance. See here for some benchmarks: http://lh3lh3.users.sourceforge.net/reb.shtml Those benchmarks are now coming up on 8 years old, so we should revisit that. Some of the linking issues might be mitigated by using {{}} on Windows and {{regcomp}} on UNIX platforms. I'm unaware of the perils of that solution (if any). > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432925#comment-16432925 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180559142 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -1326,6 +1326,22 @@ def test_decimal_with_different_precisions(self): expected = [decimal.Decimal('0.01000'), decimal.Decimal('0.00100')] assert array.to_pylist() == expected +def test_decimal_with_None_explicit_type(self): +data = [ +decimal.Decimal('3.14'), +None, +] +series = pd.Series(data) +_check_series_roundtrip(series, type_=pa.decimal128(12, 5)) + +def test_decimal_with_None_infer_type(self): Review comment: sure This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432926#comment-16432926 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180559192 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { +RETURN_NOT_OK(max_decimal_metadata.Update(object)); + } } type_ = Review comment: I'll add that This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432923#comment-16432923 ] Bryan Cutler commented on ARROW-2432: - BTW, it looks like {{pa.array(data, type=pa.decimal128(12, 4), from_pandas=True)}} has the same code path that causes the error, as long as {{data}} is a pandas.Series {noformat} In [5]: pa.array(pd.Series([Decimal('3.14'), None]), from_pandas=True, type=pa.decimal128(12, 4)) --- ArrowInvalid Traceback (most recent call last) in () > 1 pa.array(pd.Series([Decimal('3.14'), None]), from_pandas=True, type=pa.decimal128(12, 4)) ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal {noformat} > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2224) [C++] Replace boost regex usage with libre2
[ https://issues.apache.org/jira/browse/ARROW-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432890#comment-16432890 ] Antoine Pitrou commented on ARROW-2224: --- Can you explain the "should"? AFAICS libre2 isn't header-only, so it may present the same linking and ABI issues as boost-regex. > [C++] Replace boost regex usage with libre2 > --- > > Key: ARROW-2224 > URL: https://issues.apache.org/jira/browse/ARROW-2224 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > We're using {{boost::regex}} to parse decimal strings for {{decimal128}} > types. We should use {{libre2}} instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2440) [Rust] Implement ListBuilder
[ https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2440. --- Resolution: Fixed Issue resolved by pull request 1876 [https://github.com/apache/arrow/pull/1876] > [Rust] Implement ListBuilder > --- > > Key: ARROW-2440 > URL: https://issues.apache.org/jira/browse/ARROW-2440 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Implement ListBuilder -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict
[ https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2441: - Assignee: Andy Grove > [Rust] Builder::slice_mut assertions are too strict > -- > > Key: ARROW-2441 > URL: https://issues.apache.org/jira/browse/ARROW-2441 > Project: Apache Arrow > Issue Type: Bug >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The assertions only allow slice up to builder length, rather than up to > builder capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict
[ https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432846#comment-16432846 ] ASF GitHub Bot commented on ARROW-2441: --- pitrou closed pull request #1877: ARROW-2441: [Rust] Builder::slice_mut assertions are too strict URL: https://github.com/apache/arrow/pull/1877 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/rust/src/builder.rs b/rust/src/builder.rs index 15f492d6d..421b1a9a1 100644 --- a/rust/src/builder.rs +++ b/rust/src/builder.rs @@ -65,8 +65,8 @@ impl Builder { /// Get the internal byte-aligned memory buffer as a mutable slice pub fn slice_mut(&self, start: usize, end: usize) -> &mut [T] { assert!(start <= end); -assert!(start < self.len as usize); -assert!(end <= self.len as usize); +assert!(start < self.capacity as usize); +assert!(end <= self.capacity as usize); unsafe { slice::from_raw_parts_mut(self.data.offset(start as isize), (end - start) as usize) } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Builder::slice_mut assertions are too strict > -- > > Key: ARROW-2441 > URL: https://issues.apache.org/jira/browse/ARROW-2441 > Project: Apache Arrow > Issue Type: Bug >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The assertions only allow slice up to builder length, rather than up to > builder capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2440) [Rust] Implement ListBuilder
[ https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432848#comment-16432848 ] ASF GitHub Bot commented on ARROW-2440: --- pitrou closed pull request #1876: ARROW-2440: [Rust] Implement ListBuilder URL: https://github.com/apache/arrow/pull/1876 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/rust/src/lib.rs b/rust/src/lib.rs index 6ab3daabb..e0c14db87 100644 --- a/rust/src/lib.rs +++ b/rust/src/lib.rs @@ -28,4 +28,5 @@ pub mod builder; pub mod datatypes; pub mod error; pub mod list; +pub mod list_builder; pub mod memory; diff --git a/rust/src/list.rs b/rust/src/list.rs index a7b11a628..fad0ed372 100644 --- a/rust/src/list.rs +++ b/rust/src/list.rs @@ -17,9 +17,8 @@ use std::str; -use bytes::{BufMut, BytesMut}; - use super::buffer::Buffer; +use super::list_builder::ListBuilder; pub struct List { pub data: Buffer, @@ -40,19 +39,11 @@ impl List { impl From> for List { fn from(v: Vec) -> Self { -let mut offsets: Vec = Vec::with_capacity(v.len() + 1); -let mut buf = BytesMut::with_capacity(v.len() * 32); -offsets.push(0_i32); +let mut b: ListBuilder = ListBuilder::with_capacity(v.len()); v.iter().for_each(|s| { -let slice = s.as_bytes(); -buf.reserve(slice.len()); -buf.put(slice); -offsets.push(buf.len() as i32); +b.push(s.as_bytes()); }); -List { -data: Buffer::from(buf.freeze()), -offsets: Buffer::from(offsets), -} +b.finish() } } diff --git a/rust/src/list_builder.rs b/rust/src/list_builder.rs new file mode 100644 index 0..0122e680e --- /dev/null +++ b/rust/src/list_builder.rs @@ -0,0 +1,66 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use super::builder::*; +use super::list::List; + +pub struct ListBuilder { +data: Builder, +offsets: Builder, +} + +impl ListBuilder { +pub fn new() -> Self { +ListBuilder::with_capacity(64) +} + +pub fn with_capacity(n: usize) -> Self { +let data = Builder::with_capacity(n); +let mut offsets = Builder::with_capacity(n); +offsets.push(0_i32); +ListBuilder { data, offsets } +} + +pub fn push(&mut self, slice: &[T]) { +self.data.push_slice(slice); +self.offsets.push(self.data.len() as i32); +} + +pub fn finish(&mut self) -> List { +List { +data: self.data.finish(), +offsets: self.offsets.finish(), +} +} +} + +#[cfg(test)] +mod tests { +use super::*; + +#[test] +fn test_list_u8() { +let mut b: ListBuilder = ListBuilder::new(); +b.push("Hello, ".as_bytes()); +b.push("World!".as_bytes()); +let buffer = b.finish(); + +assert_eq!(2, buffer.len()); +assert_eq!("Hello, ".as_bytes(), buffer.slice(0)); +assert_eq!("World!".as_bytes(), buffer.slice(1)); +} +} This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Implement ListBuilder > --- > > Key: ARROW-2440 > URL: https://issues.apache.org/jira/browse/ARROW-2440 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Implement ListBuilder -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict
[ https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2441. --- Resolution: Fixed Issue resolved by pull request 1877 [https://github.com/apache/arrow/pull/1877] > [Rust] Builder::slice_mut assertions are too strict > -- > > Key: ARROW-2441 > URL: https://issues.apache.org/jira/browse/ARROW-2441 > Project: Apache Arrow > Issue Type: Bug >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The assertions only allow slice up to builder length, rather than up to > builder capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.
[ https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432831#comment-16432831 ] ASF GitHub Bot commented on ARROW-2435: --- andygrove commented on issue #1875: ARROW-2435: [Rust] Add memory pool abstraction. URL: https://github.com/apache/arrow/pull/1875#issuecomment-380221529 We should definitely isolate all memory alloc/free in the one file, which I think this PR does nicely. We just need to merge the windows changes and fix the oustanding issue where realloc is failing to free the old memory and I think we're good. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add memory pool abstraction. > --- > > Key: ARROW-2435 > URL: https://issues.apache.org/jira/browse/ARROW-2435 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.9.0 >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > > Add memory pool abstraction as the c++ api. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict
[ https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432820#comment-16432820 ] ASF GitHub Bot commented on ARROW-2441: --- andygrove commented on issue #1877: ARROW-2441: [Rust] Builder::slice_mut assertions are too strict URL: https://github.com/apache/arrow/pull/1877#issuecomment-380220388 @pitrou yes please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Builder::slice_mut assertions are too strict > -- > > Key: ARROW-2441 > URL: https://issues.apache.org/jira/browse/ARROW-2441 > Project: Apache Arrow > Issue Type: Bug >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The assertions only allow slice up to builder length, rather than up to > builder capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2440) [Rust] Implement ListBuilder
[ https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432817#comment-16432817 ] ASF GitHub Bot commented on ARROW-2440: --- andygrove commented on issue #1876: ARROW-2440: [Rust] Implement ListBuilder URL: https://github.com/apache/arrow/pull/1876#issuecomment-380220309 @pitrou yes please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Implement ListBuilder > --- > > Key: ARROW-2440 > URL: https://issues.apache.org/jira/browse/ARROW-2440 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Implement ListBuilder -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2440) [Rust] Implement ListBuilder
[ https://issues.apache.org/jira/browse/ARROW-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432799#comment-16432799 ] ASF GitHub Bot commented on ARROW-2440: --- pitrou commented on issue #1876: ARROW-2440: [Rust] Implement ListBuilder URL: https://github.com/apache/arrow/pull/1876#issuecomment-380217336 Is this good for merging? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Implement ListBuilder > --- > > Key: ARROW-2440 > URL: https://issues.apache.org/jira/browse/ARROW-2440 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Implement ListBuilder -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2441) [Rust] Builder::slice_mut assertions are too strict
[ https://issues.apache.org/jira/browse/ARROW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432797#comment-16432797 ] ASF GitHub Bot commented on ARROW-2441: --- pitrou commented on issue #1877: ARROW-2441: [Rust] Builder::slice_mut assertions are too strict URL: https://github.com/apache/arrow/pull/1877#issuecomment-380217238 Is this good for merging? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Builder::slice_mut assertions are too strict > -- > > Key: ARROW-2441 > URL: https://issues.apache.org/jira/browse/ARROW-2441 > Project: Apache Arrow > Issue Type: Bug >Reporter: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The assertions only allow slice up to builder length, rather than up to > builder capacity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432777#comment-16432777 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180532204 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { +RETURN_NOT_OK(max_decimal_metadata.Update(object)); + } } type_ = Review comment: By the way, what happens here if all items are None? Do we have a test for that? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua Storck updated ARROW-1938: - Comment: was deleted (was: Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453) > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432774#comment-16432774 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180531835 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { if (type_ == NULLPTR) { for (PyObject* object : objects) { - RETURN_NOT_OK(max_decimal_metadata.Update(object)); + if (!internal::PandasObjectIsNull(object)) { Review comment: Do we care about accepting other NULL-like objects such as `float('nan')`? Otherwise `object != Py_None` is a much faster check. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432778#comment-16432778 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180532566 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -758,22 +760,19 @@ Status NumPyConverter::ConvertDecimals() { for (PyObject* object : objects) { const int is_decimal = PyObject_IsInstance(object, decimal_type_.obj()); -if (ARROW_PREDICT_FALSE(is_decimal == 0)) { +if (is_decimal == 1) { + Decimal128 value; + RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value)); + RETURN_NOT_OK(builder.Append(value)); +} else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) { Review comment: Same question as above: do we care about other NULL-like values than simply None? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432776#comment-16432776 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180532831 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -1326,6 +1326,22 @@ def test_decimal_with_different_precisions(self): expected = [decimal.Decimal('0.01000'), decimal.Decimal('0.00100')] assert array.to_pylist() == expected +def test_decimal_with_None_explicit_type(self): +data = [ +decimal.Decimal('3.14'), +None, +] +series = pd.Series(data) +_check_series_roundtrip(series, type_=pa.decimal128(12, 5)) + +def test_decimal_with_None_infer_type(self): Review comment: Can you also check that the expected type is inferred? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432775#comment-16432775 ] ASF GitHub Bot commented on ARROW-2432: --- pitrou commented on a change in pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#discussion_r180531517 ## File path: cpp/src/arrow/python/decimal.cc ## @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc } Status DecimalMetadata::Update(PyObject* object) { - DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; + bool is_decimal = PyDecimal_Check(object); Review comment: I don't think it's ok to do this in optimized build. `DecimalMetadata` expects you to pass a decimal object. @cpcloud may confirm. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432773#comment-16432773 ] Joshua Storck commented on ARROW-1938: -- Bug fix in this PR: https://github.com/apache/parquet-cpp/pull/453 > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432768#comment-16432768 ] ASF GitHub Bot commented on ARROW-1938: --- joshuastorck opened a new pull request #453: Bug fix for ARROW-1938 URL: https://github.com/apache/parquet-cpp/pull/453 The error was reported here: https://issues.apache.org/jira/browse/ARROW-1938. Because dictionary types are not supported in writing yet, the code converts the dictionary column to the actual values first before writing. However, the existing code was accidentally using zero as the offset and the length of the column as the size. This resulted in writing all of the column values for each chunk of the column that was supposed to be written. The fix is to pass the offset and size when recursively calling through to WriteColumnChunk with the "flattened" data. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset
[ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1938: -- Labels: pull-request-available (was: ) > [Python] Error writing to partitioned Parquet dataset > - > > Key: ARROW-1938 > URL: https://issues.apache.org/jira/browse/ARROW-1938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux (Ubuntu 16.04) >Reporter: Robert Dailey >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, > pyarrow_dataset_error.png > > > I receive the following error after upgrading to pyarrow 0.8.0 when writing > to a dataset: > * ArrowIOError: Column 3 had 187374 while previous column had 1 > The command was: > write_table_values = {'row_group_size': 1} > pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), > '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', > 'hour'], **write_table_values) > I've also tried write_table_values = {'chunk_size': 1} and received the > same error. > This same command works in version 0.7.1. I am trying to troubleshoot the > problem but wanted to submit a ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432756#comment-16432756 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler commented on issue #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878#issuecomment-380206853 Ping @cpcloud and @pitrou, please take a look when you can, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432755#comment-16432755 ] ASF GitHub Bot commented on ARROW-2432: --- BryanCutler opened a new pull request #1878: ARROW-2432: [Python] Fix Pandas decimal type conversion with None values URL: https://github.com/apache/arrow/pull/1878 This fixes conversion of Pandas decimal types to Arrow with None values. Previously, if the type was specified, an error would occur when checking if the object was Decimal. If the type was not specified, a segmentation fault would occur when attempting to find the max precision and scale. Added new tests which include None values for both the above cases. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2432) [Python] from_pandas fails when converting decimals if have None values
[ https://issues.apache.org/jira/browse/ARROW-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2432: -- Labels: pull-request-available (was: ) > [Python] from_pandas fails when converting decimals if have None values > --- > > Key: ARROW-2432 > URL: https://issues.apache.org/jira/browse/ARROW-2432 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > > Using from_pandas to convert decimals fails if encounters a value of > {{None}}. For example: > {code:java} > In [1]: import pyarrow as pa > ...: import pandas as pd > ...: from decimal import Decimal > ...: > In [2]: s_dec = pd.Series([Decimal('3.14'), None]) > In [3]: pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.Array.from_pandas(s_dec, type=pa.decimal128(3, 2)) > array.pxi in pyarrow.lib.Array.from_pandas() > array.pxi in pyarrow.lib.array() > error.pxi in pyarrow.lib.check_status() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error converting from Python objects to Decimal: Got Python > object of type NoneType but can only handle these types: decimal.Decimal > {code} > The above error is raised when specifying decimal type. When no type is > specified, a seg fault happens. > This previously worked in 0.8.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432692#comment-16432692 ] Bryant Menn commented on ARROW-2367: So would the ideal implementation then create {{pa.chunked_array()}} and have {{pa.Table.from_pandas()}} call that instead or would there be some criteria to decide between {{pa.array()}} (I believe this is the current flow for {{pa.Tablefrom_pandas()}}) and {{pa.chunked_array()}}? I was trying to follow the process implemented for {{BinaryArray}} conversion to {{ChunkedArray}} but not finished working through and understanding it.. > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432609#comment-16432609 ] Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 5:20 PM: - Got it now Uwe, thanks! Just saw the assertion line, totally missed it. was (Author: kszucs): Got it now Uwe, thanks! > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432609#comment-16432609 ] Krisztian Szucs commented on ARROW-2367: Got it now Uwe, thanks! > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder
[ https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432608#comment-16432608 ] ASF GitHub Bot commented on ARROW-2411: --- pitrou commented on issue #1852: ARROW-2411: [C++] Add StringBuilder::Append(const char **values) URL: https://github.com/apache/arrow/pull/1852#issuecomment-380179127 Thanks @kou ! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Add method to append batches of null-terminated strings to StringBuilder > -- > > Key: ARROW-2411 > URL: https://issues.apache.org/jira/browse/ARROW-2411 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, GLib >Reporter: Uwe L. Korn >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > We should add a method {{StringBuilder::AppendCStrings(const char** values, > const uint8_t* valid_bytes = NULLPTR)}} to the {{StringBuilder}} class to > have fast inserts for these strings. See > https://github.com/apache/arrow/pull/1845/files for a use case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder
[ https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2411: - Assignee: Kouhei Sutou > [C++] Add method to append batches of null-terminated strings to StringBuilder > -- > > Key: ARROW-2411 > URL: https://issues.apache.org/jira/browse/ARROW-2411 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, GLib >Reporter: Uwe L. Korn >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > We should add a method {{StringBuilder::AppendCStrings(const char** values, > const uint8_t* valid_bytes = NULLPTR)}} to the {{StringBuilder}} class to > have fast inserts for these strings. See > https://github.com/apache/arrow/pull/1845/files for a use case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder
[ https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2411. --- Resolution: Fixed Issue resolved by pull request 1852 [https://github.com/apache/arrow/pull/1852] > [C++] Add method to append batches of null-terminated strings to StringBuilder > -- > > Key: ARROW-2411 > URL: https://issues.apache.org/jira/browse/ARROW-2411 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, GLib >Reporter: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > We should add a method {{StringBuilder::AppendCStrings(const char** values, > const uint8_t* valid_bytes = NULLPTR)}} to the {{StringBuilder}} class to > have fast inserts for these strings. See > https://github.com/apache/arrow/pull/1845/files for a use case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2411) [C++] Add method to append batches of null-terminated strings to StringBuilder
[ https://issues.apache.org/jira/browse/ARROW-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432605#comment-16432605 ] ASF GitHub Bot commented on ARROW-2411: --- pitrou closed pull request #1852: ARROW-2411: [C++] Add StringBuilder::Append(const char **values) URL: https://github.com/apache/arrow/pull/1852 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index fb1bebfca..2ae9f48db 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -1022,6 +1022,72 @@ TEST_F(TestStringBuilder, TestAppendVector) { } } +TEST_F(TestStringBuilder, TestAppendCStringsWithValidBytes) { + const char* strings[] = {nullptr, "aaa", nullptr, "ignored", ""}; + vector valid_bytes = {1, 1, 1, 0, 1}; + + int N = static_cast(sizeof(strings) / sizeof(strings[0])); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { +ASSERT_OK(builder_->Append(strings, N, valid_bytes.data())); + } + Done(); + + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps * 3, result_->null_count()); + ASSERT_EQ(reps * 3, result_->value_data()->size()); + + int32_t length; + int32_t pos = 0; + for (int i = 0; i < N * reps; ++i) { +auto string = strings[i % N]; +if (string && valid_bytes[i % N]) { + ASSERT_FALSE(result_->IsNull(i)); + result_->GetValue(i, &length); + ASSERT_EQ(pos, result_->value_offset(i)); + ASSERT_EQ(static_cast(strlen(string)), length); + ASSERT_EQ(strings[i % N], result_->GetString(i)); + + pos += length; +} else { + ASSERT_TRUE(result_->IsNull(i)); +} + } +} + +TEST_F(TestStringBuilder, TestAppendCStringsWithoutValidBytes) { + const char* strings[] = {"", "bb", "a", nullptr, "ccc"}; + + int N = static_cast(sizeof(strings) / sizeof(strings[0])); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { +ASSERT_OK(builder_->Append(strings, N)); + } + Done(); + + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps, result_->null_count()); + ASSERT_EQ(reps * 6, result_->value_data()->size()); + + int32_t length; + int32_t pos = 0; + for (int i = 0; i < N * reps; ++i) { +if (strings[i % N]) { + ASSERT_FALSE(result_->IsNull(i)); + result_->GetValue(i, &length); + ASSERT_EQ(pos, result_->value_offset(i)); + ASSERT_EQ(static_cast(strlen(strings[i % N])), length); + ASSERT_EQ(strings[i % N], result_->GetString(i)); + + pos += length; +} else { + ASSERT_TRUE(result_->IsNull(i)); +} + } +} + TEST_F(TestStringBuilder, TestZeroLength) { // All buffers are null Done(); diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index c97253e64..78c42f4fc 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -1413,6 +1413,64 @@ Status StringBuilder::Append(const std::vector& values, return Status::OK(); } +Status StringBuilder::Append(const char** values, int64_t length, + const uint8_t* valid_bytes) { + std::size_t total_length = 0; + std::vector value_lengths(length); + bool have_null_value = false; + for (int64_t i = 0; i < length; ++i) { +if (values[i]) { + auto value_length = strlen(values[i]); + value_lengths[i] = value_length; + total_length += value_length; +} else { + have_null_value = true; +} + } + RETURN_NOT_OK(Reserve(length)); + RETURN_NOT_OK(value_data_builder_.Reserve(total_length)); + RETURN_NOT_OK(offsets_builder_.Reserve(length)); + + if (valid_bytes) { +int64_t valid_bytes_offset = 0; +for (int64_t i = 0; i < length; ++i) { + RETURN_NOT_OK(AppendNextOffset()); + if (valid_bytes[i]) { +if (values[i]) { + RETURN_NOT_OK(value_data_builder_.Append( + reinterpret_cast(values[i]), value_lengths[i])); +} else { + UnsafeAppendToBitmap(valid_bytes + valid_bytes_offset, i - valid_bytes_offset); + UnsafeAppendToBitmap(false); + valid_bytes_offset = i + 1; +} + } +} +UnsafeAppendToBitmap(valid_bytes + valid_bytes_offset, length - valid_bytes_offset); + } else { +if (have_null_value) { + std::vector valid_vector(length, 0); + for (int64_t i = 0; i < length; ++i) { +RETURN_NOT_OK(AppendNextOffset()); +if (values[i]) { + RETURN_NOT_OK(value_data_builder_.Append( + reinterpret_cast(values[i]), value_lengths[i])); + valid_vector[i] = 1; +} + } + UnsafeAppendToBitmap(valid_vector.data(), length); +} else { + for (int64_t i = 0; i < length; ++i) { +RETURN_NOT_OK(AppendNextOffset()); +RETUR
[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432606#comment-16432606 ] Uwe L. Korn commented on ARROW-2367: The expected output would be that the resulting {{pyarrow.Table}} has a {{ChunkedArray}} instance for the respective column that consists of multiple chunks. Currently we only generate a single {{pa.Array}} and wrap that simply in a {{pa.ChunkedArray}}. Thus we are limited to what a single {{pa.Array}} can hold. This should only happen on {{pa.Table.from_pandas()}} though. When a user calls {{pa.array()}} the current error is expected behaviour. We could add a second method {{pa.chunked_array()}} that would return a {{ChunkedArray}} with more than one chunk if necessary. > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc
[ https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432580#comment-16432580 ] Uwe L. Korn commented on ARROW-1171: > I'm assuming the issue is that the libstdc++ Arrow was (statically or not) > linked with also gets used at runtime to resolve symbols in turbodbc (or the > reverse)? Am I right? Yes. There are some symbols that are in the C++ STL that are always statically linked. We provide a version script and don't export our symbols but sadly other libraries export theirs and this then leads to a crash in Arrow. Using {{-Bsymbolic}} (and friends!) would be the path that would ensure that Arrow actually uses the symbols that were packaged with it. Sadly this has some nasty side effects. When I have some more time, I try to dive into this issue again and read also the linked articles. Currently this problem here is no a high priority for me. > C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and > self-compiled turbodbc > > > Key: ARROW-1171 > URL: https://issues.apache.org/jira/browse/ARROW-1171 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.4.1 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Original issue: https://github.com/blue-yonder/turbodbc/issues/102 > When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora > 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute > here is that the compiler version used for building Turbodbc is newer than > the one used for Arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders
[ https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432556#comment-16432556 ] ASF GitHub Bot commented on ARROW-2330: --- alendit commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer creation with partially finishable array builders URL: https://github.com/apache/arrow/pull/1769#issuecomment-380165985 Hi Uwe, I see what you mean. Now that I've looked more carefully into `BufferSlices`, I see how much danger they can bear. Something like [this](https://gist.github.com/alendit/f759bc12e03d9dd9d72cac90a6334cc5) would cause a read-after-free. The problem, as I see it, is, that the SliceBuffer user has no way to ensure its validity. So even if we skip this PR, the problems with slices might happen in the future. I think some lean solution, like adding a `shared_ptr parent` to the slice and referencing its data instead, will increase the memory safety and might benefit this PR, too. Do you think we should discuss it on the mailing list? Cheers, Dimitri. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Optimize delta buffer creation with partially finishable array builders > - > > Key: ARROW-2330 > URL: https://issues.apache.org/jira/browse/ARROW-2330 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.8.0 >Reporter: Dimitri Vorona >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The main aim of this change is to optimize the building of delta > dictionaries. In the current version delta dictionaries are built using an > additional "overflow" buffer which leads to complicated and potentially > error-prone code and subpar performance by doubling the number of lookups. > I solve this problem by introducing the notion of partially finishable array > builders, i.e. builder which are able to retain the state on calling Finish. > The interface is based on RecordBatchBuilder::Flush, i.e. Finish is > overloaded with additional signature Finish(bool reset_builder, > std::shared_ptr* out). The resulting Arrays point to the same data > buffer with different offsets. > I'm aware that the change is kind of biggish, but I'd like to discuss it > here. The solution makes the code more straight forward, doesn't bloat the > code base too much and leaves the API more or less untouched. Additionally, > the new way to make delta dictionaries by using a different call signature to > Finish feel cleaner to me. > I'm looking forward to your critic and improvement ideas. > The pull request is available at: https://github.com/apache/arrow/pull/1769 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.
[ https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432485#comment-16432485 ] ASF GitHub Bot commented on ARROW-2435: --- paddyhoran commented on issue #1875: ARROW-2435: [Rust] Add memory pool abstraction. URL: https://github.com/apache/arrow/pull/1875#issuecomment-380141516 This is similar to what we had in iron-arrow before the contribution to the main arrow repo but it seemed that @andygrove went a different direction initially. I just added windows support last night to memory.rs and friends. Do we plan to isolate libc calls to this new memory pool (replacing the memory.rs approach)? Just wondering what the direction is as the above memory pool approach does not work for windows. I can add windows support for memory pool also once it's merged. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add memory pool abstraction. > --- > > Key: ARROW-2435 > URL: https://issues.apache.org/jira/browse/ARROW-2435 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.9.0 >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > > Add memory pool abstraction as the c++ api. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc
[ https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432484#comment-16432484 ] Antoine Pitrou commented on ARROW-1171: --- I'm assuming the issue is that the libstdc++ Arrow was (statically or not) linked with also gets used at runtime to resolve symbols in turbodbc (or the reverse)? Am I right? Apparently another solution would be to use something called a "version script" to control which symbols are exported from the Arrow library, so that the static libstdc++ doesn't participate in dynamic symbol resolution: [https://www.technovelty.org/c/what-exactly-does-bsymblic-do.html] [http://anadoxin.org/blog/control-over-symbol-exports-in-gcc.html] (I may be speaking cluelessly here) > C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and > self-compiled turbodbc > > > Key: ARROW-1171 > URL: https://issues.apache.org/jira/browse/ARROW-1171 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.4.1 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Original issue: https://github.com/blue-yonder/turbodbc/issues/102 > When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora > 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute > here is that the compiler version used for building Turbodbc is newer than > the one used for Arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432481#comment-16432481 ] ASF GitHub Bot commented on ARROW-2434: --- paddyhoran commented on issue #1873: ARROW-2434: [Rust] Add windows support URL: https://github.com/apache/arrow/pull/1873#issuecomment-380146676 Ah I see. Thanks @xhochy, sorry for the silly question... This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2444) Better handle reading empty parquet files
[ https://issues.apache.org/jira/browse/ARROW-2444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Crist updated ARROW-2444: - Description: >From [https://github.com/dask/dask/pull/3387#issuecomment-380140003] Currently pyarrow reads empty parts as float64, even if the underlying columns have other dtypes. This can cause problems for pandas downstream, as certain operations are only valid on certain dtypes, even if the columns are empty. Copying the comment Uwe over: bq. {quote}This is the expected behaviour as an empty string column in Pandas is simply an empty column of type object. Sadly object does not tell us much about the type of the column at all. We return numpy.float64 in this case as it's the most efficient type to store nulls in Pandas.{quote} {quote}This seems unintuitive at best to me. An empty object column in pandas is treated differently in many operations than an empty float64 column (str accessor is available, excluded from numeric operations, etc..). Having an empty file read in as a different dtype than was written could lead to errors in processing code downstream. Would arrow be willing to change this behavior?{quote} We should probably use another method than `field.type.to_pandas_dtype()` in this case. The column saved in Parquet should be saved with `NA` as type which sadly does not provide enough information. We also store the original dtype in the Pandas metadata that is used for the actual DataFrame reconstruction later on. If we would also pick up the metadata when it was written, we should be able to correctly reconstruct the dtype. was: >From [https://github.com/dask/dask/pull/3387#issuecomment-380140003] Currently pyarrow reads empty parts as float64, even if the underlying columns have other dtypes. This can cause problems for pandas downstream, as certain operations are only valid on certain dtypes, even if the columns are empty. Copying the comment Uwe over: bq. {quote}This is the expected behaviour as an empty string column in Pandas is simply an empty column of type object. Sadly object does not tell us much about the type of the column at all. We return numpy.float64 in this case as it's the most efficient type to store nulls in Pandas.{quote} {quote}This seems unintuitive at best to me. An empty object column in pandas is treated differently in many operations than an empty float64 column (str accessor is available, excluded from numeric operations, etc..). Having an empty file read in as a different dtype than was written could lead to errors in processing code downstream. Would arrow be willing to change this behavior?{quote} We should probably use another method than `field.type.to_pandas_dtype()` in this case. The column saved in Parquet should be saved with `NA` as type which sadly does not provide enough information. We also store the original dtype in the Pandas metadata that is used for the actual DataFrame reconstruction later on. If we would also pick up the metadata when it was written, we should be able to correctly reconstruct the dtype. {quote} > Better handle reading empty parquet files > - > > Key: ARROW-2444 > URL: https://issues.apache.org/jira/browse/ARROW-2444 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jim Crist >Priority: Major > > From [https://github.com/dask/dask/pull/3387#issuecomment-380140003] > > Currently pyarrow reads empty parts as float64, even if the underlying > columns have other dtypes. This can cause problems for pandas downstream, as > certain operations are only valid on certain dtypes, even if the columns are > empty. > > Copying the comment Uwe over: > > bq. {quote}This is the expected behaviour as an empty string column in Pandas > is simply an empty column of type object. Sadly object does not tell us much > about the type of the column at all. We return numpy.float64 in this case as > it's the most efficient type to store nulls in Pandas.{quote} > {quote}This seems unintuitive at best to me. An empty object column in pandas > is treated differently in many operations than an empty float64 column (str > accessor is available, excluded from numeric operations, etc..). Having an > empty file read in as a different dtype than was written could lead to errors > in processing code downstream. Would arrow be willing to change this > behavior?{quote} > We should probably use another method than `field.type.to_pandas_dtype()` in > this case. The column saved in Parquet should be saved with `NA` as type > which sadly does not provide enough information. > We also store the original dtype in the Pandas metadata that is used for the > actual DataFrame reconstruction later on. If we would also pick up the > metadata when it was written, we should be able t
[jira] [Created] (ARROW-2444) Better handle reading empty parquet files
Jim Crist created ARROW-2444: Summary: Better handle reading empty parquet files Key: ARROW-2444 URL: https://issues.apache.org/jira/browse/ARROW-2444 Project: Apache Arrow Issue Type: Improvement Reporter: Jim Crist >From [https://github.com/dask/dask/pull/3387#issuecomment-380140003] Currently pyarrow reads empty parts as float64, even if the underlying columns have other dtypes. This can cause problems for pandas downstream, as certain operations are only valid on certain dtypes, even if the columns are empty. Copying the comment Uwe over: bq. {quote}This is the expected behaviour as an empty string column in Pandas is simply an empty column of type object. Sadly object does not tell us much about the type of the column at all. We return numpy.float64 in this case as it's the most efficient type to store nulls in Pandas.{quote} {quote}This seems unintuitive at best to me. An empty object column in pandas is treated differently in many operations than an empty float64 column (str accessor is available, excluded from numeric operations, etc..). Having an empty file read in as a different dtype than was written could lead to errors in processing code downstream. Would arrow be willing to change this behavior?{quote} We should probably use another method than `field.type.to_pandas_dtype()` in this case. The column saved in Parquet should be saved with `NA` as type which sadly does not provide enough information. We also store the original dtype in the Pandas metadata that is used for the actual DataFrame reconstruction later on. If we would also pick up the metadata when it was written, we should be able to correctly reconstruct the dtype. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461 ] Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 3:29 PM: - What's the expected output? It should be able to hold `2^31 - 1` bytes? According to my investigation https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44 limits to `((2^31) - 1) - 1` was (Author: kszucs): What's the expected output? It should be able to hold `2^31 - 1` bytes? According to my investigation https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44 limits to `(2^31) - 1` > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461 ] Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 3:29 PM: - What's the expected output? It should be able to hold `2^31 - 1` bytes? According to my investigation the limit is set to `((2^31) - 1) - 1`. https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44 was (Author: kszucs): What's the expected output? It should be able to hold `2^31 - 1` bytes? According to my investigation https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44 limits to `((2^31) - 1) - 1` > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461 ] Krisztian Szucs edited comment on ARROW-2367 at 4/10/18 3:28 PM: - What's the expected output? It should be able to hold `2^31 - 1` bytes? According to my investigation https://github.com/apache/arrow/blob/f9c070162acd87caf031723fb8d7e9c9a7317c0e/cpp/src/arrow/builder.h#L44 limits to `(2^31) - 1` was (Author: kszucs): What's the expected output? > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2367) ListArray has trouble with sizes greater than kMaximumCapacity
[ https://issues.apache.org/jira/browse/ARROW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432461#comment-16432461 ] Krisztian Szucs commented on ARROW-2367: What's the expected output? > ListArray has trouble with sizes greater than kMaximumCapacity > -- > > Key: ARROW-2367 > URL: https://issues.apache.org/jira/browse/ARROW-2367 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Bryant Menn >Priority: Major > > When creating a Pandas dataframe with lists as elements as a column the > following error occurs when converting to a {{pyarrow.Table}} object. > {code} > Traceback (most recent call last): > File "arrow-2227.py", line 16, in > arr = pa.array(df['strings'], from_pandas=True) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 77, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: BinaryArray cannot contain more than 2147483646 > bytes, have 2147483647 > {code} > The following code was used to generate the error (adapted from ARROW-2227): > {code} > import pandas as pd > import pyarrow as pa > # Commented lines were used to test non-binary data types, both cause the > same error > v1 = b'x' * 1 > v2 = b'x' * 147483646 > # v1 = 'x' * 1 > # v2 = 'x' * 147483646 > df = pd.DataFrame({ > 'strings': [[v1]] * 20 + [[v2]] + [[b'x']] > # 'strings': [[v1]] * 20 + [[v2]] + [['x']] > }) > arr = pa.array(df['strings'], from_pandas=True) > assert isinstance(arr, pa.ChunkedArray), type(arr) > {code} > Code was run using Python 3.6 with PyArrow installed from conda-forge on > macOS High Sierra. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.
[ https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432458#comment-16432458 ] ASF GitHub Bot commented on ARROW-2435: --- paddyhoran commented on issue #1875: ARROW-2435: [Rust] Add memory pool abstraction. URL: https://github.com/apache/arrow/pull/1875#issuecomment-380141516 This is similar to what we had in iron-arrow before the contribution to the main arrow repo but it seemed that @andygrove went a different direction initially. I just added windows support last night to memory.rs and friends. Do we plan to isolate libc calls to this new memory pool (replacing the memory.rs approach)? Just wondering what the direction is as the above memory pool approach does not work from windows. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add memory pool abstraction. > --- > > Key: ARROW-2435 > URL: https://issues.apache.org/jira/browse/ARROW-2435 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 0.9.0 >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: pull-request-available > > Add memory pool abstraction as the c++ api. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc
[ https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432457#comment-16432457 ] Uwe L. Korn commented on ARROW-1171: It will not end the issues but help us to move to a more modern compiler for the {{manylinux1}} packages. We will still get issues when the wrong symbols are pulled in, probably for a lot less symbols than currently. > C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and > self-compiled turbodbc > > > Key: ARROW-1171 > URL: https://issues.apache.org/jira/browse/ARROW-1171 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.4.1 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Original issue: https://github.com/blue-yonder/turbodbc/issues/102 > When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora > 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute > here is that the compiler version used for building Turbodbc is newer than > the one used for Arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc
[ https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432452#comment-16432452 ] Antoine Pitrou commented on ARROW-1171: --- As a sidenote, a new manylinux standard is being developed with a more recent baseline (CentOS 6) : https://www.python.org/dev/peps/pep-0571/ The PEP is still being discussed but chances are it will be accepted soon: https://mail.python.org/pipermail/distutils-sig/2018-March/032102.html Perhaps it will end the libstdc++ issues (or perhaps not :-)). > C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and > self-compiled turbodbc > > > Key: ARROW-1171 > URL: https://issues.apache.org/jira/browse/ARROW-1171 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.4.1 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Original issue: https://github.com/blue-yonder/turbodbc/issues/102 > When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora > 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute > here is that the compiler version used for building Turbodbc is newer than > the one used for Arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432453#comment-16432453 ] ASF GitHub Bot commented on ARROW-2434: --- xhochy commented on issue #1873: ARROW-2434: [Rust] Add windows support URL: https://github.com/apache/arrow/pull/1873#issuecomment-380140599 @paddyhoran because it lists only the top 100 contributors (not sure how the contributors with only one contribution are sorted). You will be definitely listed on the page once you make a second contribution. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2434) [Rust] Add windows support
[ https://issues.apache.org/jira/browse/ARROW-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432446#comment-16432446 ] ASF GitHub Bot commented on ARROW-2434: --- paddyhoran commented on issue #1873: ARROW-2434: [Rust] Add windows support URL: https://github.com/apache/arrow/pull/1873#issuecomment-380138524 @xhochy do you know why I am not listed on the [contributors](https://github.com/apache/arrow/graphs/contributors) page after this getting merged? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Rust] Add windows support > -- > > Key: ARROW-2434 > URL: https://issues.apache.org/jira/browse/ARROW-2434 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Currently `cargo test` fails on windows OS. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2330) [C++] Optimize delta buffer creation with partially finishable array builders
[ https://issues.apache.org/jira/browse/ARROW-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432390#comment-16432390 ] ASF GitHub Bot commented on ARROW-2330: --- xhochy commented on issue #1769: ARROW-2330: [C++] Optimize delta buffer creation with partially finishable array builders URL: https://github.com/apache/arrow/pull/1769#issuecomment-380127439 Hello Dimitri, I forgot to reply to you. This got drowned in all the other Arrow activity in the last days. Sorry about that. > Do you know why the RecordBatchBuilder has a partial Flush method? Is it because its instantion is more cumbersome compared to array builders? The `RecordBatchBuilder` is different to the other builders as it does not generate a Arrays but should be used for a more streaming-like output. In it, all builders are resetted on Flush. This is also the behaviour I would expect in the dictionary builder. > That being said, I don't think that it adds that much complexity, compared to the previous implementation. Most of the changes are in the testing code, and builder files themselves have around 30 additional LOC. At the same time, it makes the DictionaryBuilder quite straight forward. The added complexity is the thing that scares me most at the moment. The Builders are built with the intention that they complety own their buffers and can modify them as they like internally. You can ask them for snapshot access to elements to inspect them but we don't provide any guarantees that the memory addresses that were returned from them are still valid once you can a method on the builder again. As far as I understand the current implementation in some cases we return the current buffer pointer as new `Buffer` object in the BufferBuilder. Once we do expand it though later on, we may free this memory address and thus the `Buffer` object point to unreachable memory. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Optimize delta buffer creation with partially finishable array builders > - > > Key: ARROW-2330 > URL: https://issues.apache.org/jira/browse/ARROW-2330 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.8.0 >Reporter: Dimitri Vorona >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > The main aim of this change is to optimize the building of delta > dictionaries. In the current version delta dictionaries are built using an > additional "overflow" buffer which leads to complicated and potentially > error-prone code and subpar performance by doubling the number of lookups. > I solve this problem by introducing the notion of partially finishable array > builders, i.e. builder which are able to retain the state on calling Finish. > The interface is based on RecordBatchBuilder::Flush, i.e. Finish is > overloaded with additional signature Finish(bool reset_builder, > std::shared_ptr* out). The resulting Arrays point to the same data > buffer with different offsets. > I'm aware that the change is kind of biggish, but I'd like to discuss it > here. The solution makes the code more straight forward, doesn't bloat the > code base too much and leaves the API more or less untouched. Additionally, > the new way to make delta dictionaries by using a different call signature to > Finish feel cleaner to me. > I'm looking forward to your critic and improvement ideas. > The pull request is available at: https://github.com/apache/arrow/pull/1769 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2443) [Python] Conversion from pandas of empty categorical fails with ArrowInvalid
Florian Jetter created ARROW-2443: - Summary: [Python] Conversion from pandas of empty categorical fails with ArrowInvalid Key: ARROW-2443 URL: https://issues.apache.org/jira/browse/ARROW-2443 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.9.0 Reporter: Florian Jetter The conversion of an empty pandas categorical raises an exception. Before version `0.9.0` this was possible {code:java} import pandas as pd import pyarrow as pa pa.Table.from_pandas(pd.DataFrame({'cat': pd.Categorical([])})){code} raises: {{ArrowInvalid: Dictionary indices must have non-zero length}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2222) [C++] Add option to validate Flatbuffers messages
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432283#comment-16432283 ] ASF GitHub Bot commented on ARROW-: --- crepererum commented on issue #1763: ARROW-: handle untrusted inputs (POC) URL: https://github.com/apache/arrow/pull/1763#issuecomment-380104686 I would say the performance impact is negligible. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Add option to validate Flatbuffers messages > - > > Key: ARROW- > URL: https://issues.apache.org/jira/browse/ARROW- > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Marco Neumann >Priority: Major > Labels: pull-request-available > > This is follow up work to ARROW-1589, ARROW-2023, and can be validated by the > {{ipc-fuzzer-test}}. Users receiving untrusted input streams can prevent > segfaults this way > As part of this, we should quantify the overhead associated with message > validation in regular use -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2222) [C++] Add option to validate Flatbuffers messages
[ https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432282#comment-16432282 ] ASF GitHub Bot commented on ARROW-: --- crepererum commented on issue #1763: ARROW-: handle untrusted inputs (POC) URL: https://github.com/apache/arrow/pull/1763#issuecomment-380103831 ## Environment ### Build Commands ``` env CC=clang CXX=clang++ cmake -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_BENCHMARKS=ON -GNinja .. env CC=clang CXX=clang++ ninja ``` ### Conda ``` # NameVersion Build Channel boost 1.66.0 py36_1conda-forge boost-cpp 1.66.01conda-forge bzip2 1.0.6 1conda-forge ca-certificates 2018.1.18 0conda-forge certifi 2018.1.18py36_0conda-forge clangdev 6.0.0 default_0conda-forge cmake 3.11.00conda-forge curl 7.59.01conda-forge expat 2.2.5 0conda-forge icu 58.2 0conda-forge intel-openmp 2018.0.0 8 krb5 1.14.60conda-forge libcxx6.0.0 0conda-forge libgfortran 3.0.1h93005f0_2 libiconv 1.15 0conda-forge libssh2 1.8.0 2conda-forge libuv 1.20.00conda-forge libxml2 2.9.8 0conda-forge llvm-meta 6.0.0 0conda-forge llvmdev 6.0.0 default_0conda-forge mkl 2018.0.2 1 mkl_fft 1.0.1py36_1conda-forge mkl_random1.0.1py36_0conda-forge ncurses 5.9 10conda-forge ninja 1.8.2 1conda-forge numpy 1.14.2 py36ha9ae307_1 openssl 1.0.2n0conda-forge pip 9.0.3py36_0conda-forge python3.6.5 1conda-forge readline 7.0 0conda-forge rhash 1.3.4 0conda-forge setuptools39.0.1 py36_0conda-forge sqlite3.20.12conda-forge tk8.6.7 0conda-forge wheel 0.31.0 py36_0conda-forge xz5.2.3 0conda-forge zlib 1.2.110conda-forge ``` ## Test Results ### complete PR ``` Run on (4 X 2700 MHz CPU s) 2018-04-10 15:33:15 Benchmark Time CPU Iterations BM_WriteRecordBatch/1/min_time:1.000/real_time 76462 ns 76204 ns 16080 12.7719GB/s BM_WriteRecordBatch/4/min_time:1.000/real_time 63077 ns 62791 ns 2122915.482GB/s BM_WriteRecordBatch/16/min_time:1.000/real_time56850 ns 56584 ns 24134 17.1778GB/s BM_WriteRecordBatch/64/min_time:1.000/real_time58209 ns 58003 ns 22891 16.7768GB/s BM_WriteRecordBatch/256/min_time:1.000/real_time 105328 ns 104930 ns 12450 9.27162GB/s BM_WriteRecordBatch/1024/min_time:1.000/real_time 247453 ns 245516 ns 5383 3.94645GB/s BM_WriteRecordBatch/4k/min_time:1.000/real_time 904931 ns 900426 ns 1384 1105.06MB/s BM_WriteRecordBatch/8k/min_time:1.000/real_time 1869873 ns1861388 ns704 534.796MB/s BM_ReadRecordBatch/1/min_time:1.000/real_time 2409 ns 2402 ns 584482 405.366GB/s BM_ReadRecordBatch/4/min_time:1.000/real_time 4919 ns 4888 ns 284403 198.544GB/s BM_ReadRecordBatch/16/min_time:1.000/real_time 14176 ns 14123 ns 73649 68.8866GB/s BM_ReadRecordBatch/64/min_time:1.000/real_time 51842 ns