[jira] [Created] (ARROW-9502) [Python][C++] Date64 converted to Date32 on parquet
Jorge created ARROW-9502: Summary: [Python][C++] Date64 converted to Date32 on parquet Key: ARROW-9502 URL: https://issues.apache.org/jira/browse/ARROW-9502 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Jorge Executing the example below, {code:python} import datetime import pyarrow as pa import pyarrow.parquet data = [ datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), datetime.datetime(2000, 1, 1) ] data32 = pa.array(data, type='date32') data64 = pa.array(data, type='date64') table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b']) pyarrow.parquet.write_table(table, 'a.parquet') print(table) print() print(pyarrow.parquet.read_table('a.parquet')) {code} yields {code:java} pyarrow.Table a: date32[day] b: date64[ms] pyarrow.Table a: date32[day] b: date32[day] <--- IMO it should be date64[ms] {code} indicating that pyarrow converted its date64[ms] schema to date32[day]. I used the rust crate to print parquet's metadata, and the value is indeed stored as i32, which suggests that this likely happens on the writer, not reader. IMO this does not have any practical implication because they are both dates and a 32 bit date (in days) can hold more dates than a 64 bit date in milliseconds, but still constitutes an error as the roundtrip serialization does not yield the same schema. A broader question I have is why data64 exists in the first place? I can't see any reason to store a *date* in milliseconds since EPOCH. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9501) [C++][Gandiva] Add logic in timestampdiff() when end date is last day of a month
Sagnik Chakraborty created ARROW-9501: - Summary: [C++][Gandiva] Add logic in timestampdiff() when end date is last day of a month Key: ARROW-9501 URL: https://issues.apache.org/jira/browse/ARROW-9501 Project: Apache Arrow Issue Type: Task Reporter: Sagnik Chakraborty {{timestampdiff}}(*month*, _startDate_, _endDate_) returns wrong result in Gandiva when the _endDate_ < _startDate_ and _endDate_ is the last day of the month. An additional month is said to have passed when the end day is greater than or equal to the start day, but this does not hold true for dates which are last days of the month. Case in point, if _startDate_ = *2020-01-31*, _endDate_ = *2020-02-29*, previously {{timestampdiff}}() returned *0*, but the correct result should be *1*. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9500) [C++] Fix segfault with std::to_string in -O3 builds on gcc 7.5.0
Wes McKinney created ARROW-9500: --- Summary: [C++] Fix segfault with std::to_string in -O3 builds on gcc 7.5.0 Key: ARROW-9500 URL: https://issues.apache.org/jira/browse/ARROW-9500 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix For: 1.0.0 There seems to be a gcc bug related to {{std::to_string}} that only appears in {{-O3}} builds. It can be seen in something innocuous like {code} return Status::Invalid("Float value ", std::to_string(val), " was truncated converting to", *output.type()); {code} where val is NaN. I haven't found a canonical reference but using something other than to_string for the formatting (here just letting {{std::ostringstream}} take care of it) makes the problem go away I wasn't able to reproduce the issue with gcc-8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9499) [C++] AdaptiveIntBuilder::null_count does not return the null count
Kenta Murata created ARROW-9499: --- Summary: [C++] AdaptiveIntBuilder::null_count does not return the null count Key: ARROW-9499 URL: https://issues.apache.org/jira/browse/ARROW-9499 Project: Apache Arrow Issue Type: Bug Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9498) [C++][Parquet] Consider revamping RleDecoder based on "upstream" changes in Apache Impala
Wes McKinney created ARROW-9498: --- Summary: [C++][Parquet] Consider revamping RleDecoder based on "upstream" changes in Apache Impala Key: ARROW-9498 URL: https://issues.apache.org/jira/browse/ARROW-9498 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Since the initial code import in 2016, Impala made some improvements to RleDecoder that we might examine to see if they are beneficial for us See https://github.com/apache/impala/blob/master/be/src/util/rle-encoding.h and history thereof -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] wesm merged pull request #40: ARROW-9497: [C++][Parquet] Add oss-fuzz test case
wesm merged pull request #40: URL: https://github.com/apache/arrow-testing/pull/40 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-9497) [C++][Parquet] Fix failure caused by malformed repetition/definition levels
Wes McKinney created ARROW-9497: --- Summary: [C++][Parquet] Fix failure caused by malformed repetition/definition levels Key: ARROW-9497 URL: https://issues.apache.org/jira/browse/ARROW-9497 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney Fix a case discovered by OSS-Fuzz -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [arrow-testing] wesm opened a new pull request #40: ARROW-9497: [C++][Parquet] Add oss-fuzz test case
wesm opened a new pull request #40: URL: https://github.com/apache/arrow-testing/pull/40 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-9496) toArray() called on filtered Table returns all rows
Peter Murphy created ARROW-9496: --- Summary: toArray() called on filtered Table returns all rows Key: ARROW-9496 URL: https://issues.apache.org/jira/browse/ARROW-9496 Project: Apache Arrow Issue Type: Bug Components: JavaScript Environment: OSX 10.15.2 Behavior seen in runkit and node.js Jest test runner Reporter: Peter Murphy Trying to experiment with building a library on top of Apache Arrow's Javascript implementation, but ran into this: Example: [https://runkit.com/pjm17971/pond-arrow] {code:java} const filtered = table.filter(predicate.col("pressure").lt(28.5)) filtered.count() // 2 (correct) {code} However: {code:java} const result = filtered.toArray().map(row => row.toJSON()) // 4 rows (??){code} Is this expected behavior? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9495) [C++] Equality assertions don't handle Inf / -Inf properly
Antoine Pitrou created ARROW-9495: - Summary: [C++] Equality assertions don't handle Inf / -Inf properly Key: ARROW-9495 URL: https://issues.apache.org/jira/browse/ARROW-9495 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Fix For: 2.0.0 I got this error when working on a PR which added unit tests: {code} ../src/arrow/testing/gtest_util.cc:101: Failure Failed Expected: [ 2.5, inf, -inf ] Actual: [ 2.5, inf, -inf ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9494) [Rust] master fails due to use of "fXX::NAN"
Paddy Horan created ARROW-9494: -- Summary: [Rust] master fails due to use of "fXX::NAN" Key: ARROW-9494 URL: https://issues.apache.org/jira/browse/ARROW-9494 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan I'm getting an error that no associated type exists. Changing to "std::fXX::NAN" fixes the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9493) [Python][Dataset] Dictionary encode string partition columns by default
Ben Kietzman created ARROW-9493: --- Summary: [Python][Dataset] Dictionary encode string partition columns by default Key: ARROW-9493 URL: https://issues.apache.org/jira/browse/ARROW-9493 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.17.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 ARROW-9139 switched the default of use_legacy_dataset from True to False, but left dictionary encoding of string partition columns off by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9492) [C++] Ensure private functions are static or in an anonymous namespace
Ben Kietzman created ARROW-9492: --- Summary: [C++] Ensure private functions are static or in an anonymous namespace Key: ARROW-9492 URL: https://issues.apache.org/jira/browse/ARROW-9492 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.17.1 Reporter: Ben Kietzman There are a number of functions which are not intended to be exported (for example, they are defined in a {{.cc}} file) but are not marked {{static inline}} or declared in an anoymous namespace. This can lead to surprising link errors. Existing private functions should be marked appropriately, and ideally a linter could be assembled to ensure new ones are not added without appropriate markings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9491) [Rust] "simd" feature is not testing in CI
Paddy Horan created ARROW-9491: -- Summary: [Rust] "simd" feature is not testing in CI Key: ARROW-9491 URL: https://issues.apache.org/jira/browse/ARROW-9491 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9490) pyarrow array creation for specific set of numpy scalars fails
Ramakrishna Prabhu created ARROW-9490: - Summary: pyarrow array creation for specific set of numpy scalars fails Key: ARROW-9490 URL: https://issues.apache.org/jira/browse/ARROW-9490 Project: Apache Arrow Issue Type: Bug Environment: conda Reporter: Ramakrishna Prabhu While creating array from a list of numpy scalars, pyarrow fails with message 'Integer scalar type not recognized', details below {code:java} // code placeholder{code} >>> import pyarrow as pa >>> import numpy as np >>> pa.array([np.int32(4), np.float64(1.5), np.float32(1.290994), np.int8(0)]) Traceback (most recent call last): File "", line 1, in File "pyarrow/array.pxi", line 269, in pyarrow.lib.array File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Integer scalar type not recognized -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9489) [C++] Add fill_null kernel implementation for (array[string], scalar[string])
Uwe Korn created ARROW-9489: --- Summary: [C++] Add fill_null kernel implementation for (array[string], scalar[string]) Key: ARROW-9489 URL: https://issues.apache.org/jira/browse/ARROW-9489 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Uwe Korn Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9488) [Release] Use the new changelog generation when updating the website
Krisztian Szucs created ARROW-9488: -- Summary: [Release] Use the new changelog generation when updating the website Key: ARROW-9488 URL: https://issues.apache.org/jira/browse/ARROW-9488 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 2.0.0 The following command updates the CHANGELOG.md, but the same content should be added as release notes in the post-03-website.sh script. See todo note https://github.com/apache/arrow/pull/7162/files#diff-58442bc78393d2113825def6aad913a0R143 {code} archery release changelog add 1.0.0 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9487) [Developer] Cover the archery release utilities with unittests
Krisztian Szucs created ARROW-9487: -- Summary: [Developer] Cover the archery release utilities with unittests Key: ARROW-9487 URL: https://issues.apache.org/jira/browse/ARROW-9487 Project: Apache Arrow Issue Type: Improvement Components: Archery, Developer Tools Reporter: Krisztian Szucs Fix For: 2.0.0 Deferring the unittest of https://github.com/apache/arrow/pull/7162 to this JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9486) [C++][Dataset] Support implicit casting InExpression::set_ to dict
Ben Kietzman created ARROW-9486: --- Summary: [C++][Dataset] Support implicit casting InExpression::set_ to dict Key: ARROW-9486 URL: https://issues.apache.org/jira/browse/ARROW-9486 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.17.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 {{test_filters_inclusive_set}} is still failing due to lack of support for cast to dictionary. Add fallbacks to DictionaryEncode if conversion to a dictionary array is required -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9485) [R] Better shared library stripping
Neal Richardson created ARROW-9485: -- Summary: [R] Better shared library stripping Key: ARROW-9485 URL: https://issues.apache.org/jira/browse/ARROW-9485 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9484) [Docs] Update is* functions to be is_* in the compute docs
Neal Richardson created ARROW-9484: -- Summary: [Docs] Update is* functions to be is_* in the compute docs Key: ARROW-9484 URL: https://issues.apache.org/jira/browse/ARROW-9484 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Followup to the followup ARROW-9390 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9483) [C++] Reorganize testing headers
Antoine Pitrou created ARROW-9483: - Summary: [C++] Reorganize testing headers Key: ARROW-9483 URL: https://issues.apache.org/jira/browse/ARROW-9483 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou Fix For: 2.0.0 Currently, {{gtest_util.h}} contains a hodge-podge of different things. It would be nice if things were separated a bit more, for example a {{asserts.h}} file for all home-grown assertion functions and macros. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9482) [Rust] [DataFusion] Implement pretty print for physical query plan
Andy Grove created ARROW-9482: - Summary: [Rust] [DataFusion] Implement pretty print for physical query plan Key: ARROW-9482 URL: https://issues.apache.org/jira/browse/ARROW-9482 Project: Apache Arrow Issue Type: Sub-task Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove Implement pretty print for physical query plan. similar to what we have for the logical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9481) [Rust] [DataFusion] Create physical plan enum to wrap execution plan
Andy Grove created ARROW-9481: - Summary: [Rust] [DataFusion] Create physical plan enum to wrap execution plan Key: ARROW-9481 URL: https://issues.apache.org/jira/browse/ARROW-9481 Project: Apache Arrow Issue Type: Sub-task Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove By wrapping the execution plan structs in an enum, we make it possible to build a tree representing the physical plan just like we do with the logical plan. This makes it easy to print physical plans and also to apply transformations to it. {code:java} pub enum PhysicalPlan { /// Projection. Projection(Arc), /// Filter a.k.a predicate. Filter(Arc), /// Hash aggregate HashAggregate(Arc), /// Performs a hash join of two child relations by first shuffling the data using the join keys. ShuffledHashJoin(ShuffledHashJoinExec), /// Performs a shuffle that will result in the desired partitioning. ShuffleExchange(Arc), /// Reads results from a ShuffleExchange ShuffleReader(Arc), /// Scans a partitioned data source ParquetScan(Arc), /// Scans an in-memory table InMemoryTableScan(Arc), }{code} h3. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9480) [Rust] [DataFusion] All DataFusion execution plan traits should require Send + Sync
Andy Grove created ARROW-9480: - Summary: [Rust] [DataFusion] All DataFusion execution plan traits should require Send + Sync Key: ARROW-9480 URL: https://issues.apache.org/jira/browse/ARROW-9480 Project: Apache Arrow Issue Type: Sub-task Components: Rust, Rust - DataFusion Reporter: Andy Grove Assignee: Andy Grove All DataFusion execution plan traits should require Send + Sync, to prepare for async support. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9479) [JS] Table.from fails for zero-item Lists, FixedSizeLists, Maps. ditto Table.empty
Nicholas Roberts created ARROW-9479: --- Summary: [JS] Table.from fails for zero-item Lists, FixedSizeLists, Maps. ditto Table.empty Key: ARROW-9479 URL: https://issues.apache.org/jira/browse/ARROW-9479 Project: Apache Arrow Issue Type: Bug Components: JavaScript Affects Versions: 0.17.1 Reporter: Nicholas Roberts deserializing zero-item tables (as generated by Table.empty or, in this case, pyarrow.Schema.serialize) with a schema containing a List, FixedList or Map fail due to an unconditional {code:java} new Data(/* preceding parameters */ buffers, [childData]){code} statement, the childData parameter resolves to [undefined] rather than the desired []. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9478) [C++] Improve error message on unsupported cast types
Antoine Pitrou created ARROW-9478: - Summary: [C++] Improve error message on unsupported cast types Key: ARROW-9478 URL: https://issues.apache.org/jira/browse/ARROW-9478 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Currently, the error message when trying an unsupported cast looks like this: {code} No cast function available to cast to dictionary {code} It would be more informative if the source type was also mentioned. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9477) [C++] Fix test case TestSchemaMetadata.MetadataVersionForwardCompatibility
Liya Fan created ARROW-9477: --- Summary: [C++] Fix test case TestSchemaMetadata.MetadataVersionForwardCompatibility Key: ARROW-9477 URL: https://issues.apache.org/jira/browse/ARROW-9477 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Liya Fan Test case TestSchemaMetadata.MetadataVersionForwardCompatibility is failing in master branch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9476) [C++][Dataset] HivePartitioning discovery with dictionary types fails for multiple fields
Joris Van den Bossche created ARROW-9476: Summary: [C++][Dataset] HivePartitioning discovery with dictionary types fails for multiple fields Key: ARROW-9476 URL: https://issues.apache.org/jira/browse/ARROW-9476 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Apparently, ARROW-9288 was not fully / correctly fixing the issue. With a single string partition field, it now works fine. But once you have multiple string fields, you get parsing errors. A reproducible example: {code} import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds foo_keys = np.array(['a', 'b', 'c'], dtype=object) bar_keys = np.array(['d', 'e', 'f'], dtype=object) N = 30 table = pa.table({ 'foo': foo_keys.repeat(10), 'bar': np.tile(np.tile(bar_keys, 5), 2), 'values': np.random.randn(N) }) base_path = "test_partition_directories3" pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"]) # works ds.dataset(base_path, partitioning="hive") # fails part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) ds.dataset(base_path, partitioning=part) {code} cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9475) Clean up usages of BaseAllocator, use BufferAllocator instead
Hongze Zhang created ARROW-9475: --- Summary: Clean up usages of BaseAllocator, use BufferAllocator instead Key: ARROW-9475 URL: https://issues.apache.org/jira/browse/ARROW-9475 Project: Apache Arrow Issue Type: Improvement Affects Versions: 0.17.0 Reporter: Hongze Zhang Assignee: Hongze Zhang Some classes' methods use BaseAllocator or cast BufferAllocator to BaseAllocator internally instead of requiring for BufferAllocator directly, e.g. codes in AllocationManager, BufferLedger. This can be optimized by exposing necessary methods from BufferAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005)