[jira] [Created] (ARROW-15441) [C++][Compute] hash_count aggregation of a null type column is incorrect
Chenxi Li created ARROW-15441: - Summary: [C++][Compute] hash_count aggregation of a null type column is incorrect Key: ARROW-15441 URL: https://issues.apache.org/jira/browse/ARROW-15441 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Chenxi Li Assignee: Chenxi Li The result of hash_count such array is incorrect. ||argument||key|| |NULL|1| |NULL|1| ||CountOptions||Expected||Actual|| |ALL|2|2| |ONLY_VALID|{color:#FF}0{color}|{color:#FF}2{color}| |ONLY_NULL|{color:#FF}2{color}|{color:#FF}0{color}| -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15440) [Go] Implement 'unpack_bool' with Arm64 GoLang Assembly
Yuqi Gu created ARROW-15440: --- Summary: [Go] Implement 'unpack_bool' with Arm64 GoLang Assembly Key: ARROW-15440 URL: https://issues.apache.org/jira/browse/ARROW-15440 Project: Apache Arrow Issue Type: Task Components: Go Reporter: Yuqi Gu Assignee: Yuqi Gu Implement 'unpack_bool' with Arm64 GoLang Assembly. {code:java} bytes_to_bools_neon {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15439) [Release] Update .deb/.rpm changelogs after release
Kouhei Sutou created ARROW-15439: Summary: [Release] Update .deb/.rpm changelogs after release Key: ARROW-15439 URL: https://issues.apache.org/jira/browse/ARROW-15439 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15438) [Python] Flaky test test_write_dataset_max_open_files
David Li created ARROW-15438: Summary: [Python] Flaky test test_write_dataset_max_open_files Key: ARROW-15438 URL: https://issues.apache.org/jira/browse/ARROW-15438 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: David Li Found during 7.0.0 verification {noformat} pyarrow/tests/test_dataset.py::test_write_dataset_max_open_files FAILED [ 30%] >>> traceback >>> tempdir >>> = >>> PosixPath('/tmp/pytest-of-root/pytest-1/test_write_dataset_max_open_fi0') >>> def >>> test_write_dataset_max_open_files(tempdir): directory = tempdir / 'ds' file_format = "parquet" partition_column_id = 1 column_names = ['c1', 'c2'] record_batch_1 = pa.record_batch(data=[[1, 2, 3, 4, 0, 10], ['a', 'b', 'c', 'd', 'e', 'a']], names=column_names) record_batch_2 = pa.record_batch(data=[[5, 6, 7, 8, 0, 1], ['a', 'b', 'c', 'd', 'e', 'c']], names=column_names) record_batch_3 = pa.record_batch(data=[[9, 10, 11, 12, 0, 1], ['a', 'b', 'c', 'd', 'e', 'd']], names=column_names) record_batch_4 = pa.record_batch(data=[[13, 14, 15, 16, 0, 1], ['a', 'b', 'c', 'd', 'e', 'b']], names=column_names) table = pa.Table.from_batches([record_batch_1, record_batch_2, record_batch_3, record_batch_4]) partitioning = ds.partitioning( pa.schema([(column_names[partition_column_id], pa.string())]), flavor="hive") data_source_1 = directory / "default" ds.write_dataset(data=table, base_dir=data_source_1, partitioning=partitioning, format=file_format) # Here we consider the number of unique partitions created when # partitioning column contains duplicate records. # Returns: (number_of_files_generated, number_of_partitions) def _get_compare_pair(data_source, record_batch, file_format, col_id): num_of_files_generated = _get_num_of_files_generated( base_directory=data_source, file_format=file_format) number_of_partitions = len(pa.compute.unique(record_batch[col_id])) return num_of_files_generated, number_of_partitions # CASE 1: when max_open_files=default & max_open_files >= num_of_partitions # In case of a writing to disk via partitioning based on a # particular column (considering row labels in that column), # the number of unique rows must be equal # to the number of files generated num_of_files_generated, number_of_partitions \ = _get_compare_pair(data_source_1, record_batch_1, file_format, partition_column_id) assert num_of_files_generated == number_of_partitions # CASE 2: when max_open_files > 0 & max_open_files < num_of_partitions # the number of files generated must be greater than the number of # partitions data_source_2 = directory / "max_1" max_open_files = 3 ds.write_dataset(data=table, base_dir=data_source_2, partitioning=partitioning, format=file_format, max_open_files=max_open_files) num_of_files_generated, number_of_partitions \ = _get_compare_pair(data_source_2, record_batch_1, file_format, partition_column_id) > assert num_of_files_generated > number_of_partitions E assert 5 > 5pyarrow/tests/test_dataset.py:3807: AssertionErrorpyarrow/tests/test_flight.py::test_interrupt > /tmp/arrow/apache-arrow-7.0.0/python/pyarrow/tests/test_flight.py(1937)test() -> read_all() (Pdb) c>> PDB continue (IO-capturing resumed) >>> FAILED [ 35%] captured stdout
[jira] [Created] (ARROW-15437) [Python][FlightRPC] Flaky test test_interrupt
David Li created ARROW-15437: Summary: [Python][FlightRPC] Flaky test test_interrupt Key: ARROW-15437 URL: https://issues.apache.org/jira/browse/ARROW-15437 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Python Reporter: David Li Found during 7.0.0 verification, it seems we aren't accounting for all possible ways to find the exception we expect {noformat} pyarrow/tests/test_flight.py::test_interrupt FAILED [ 93%] >>> traceback >>> >>> def >>> test_interrupt(): if threading.current_thread().ident != threading.main_thread().ident: pytest.skip("test only works from main Python thread") # Skips test if not available raise_signal = util.get_raise_signal() def signal_from_thread(): time.sleep(0.5) raise_signal(signal.SIGINT) exc_types = (KeyboardInterrupt, pa.ArrowCancelled) def test(read_all): try: try: t = threading.Thread(target=signal_from_thread) with pytest.raises(exc_types) as exc_info: t.start() read_all() finally: t.join() except KeyboardInterrupt: # In case KeyboardInterrupt didn't interrupt read_all # above, at least prevent it from stopping the test suite pytest.fail("KeyboardInterrupt didn't interrupt Flight read_all") e = exc_info.value.__context__ assert isinstance(e, pa.ArrowCancelled) or \ isinstance(e, KeyboardInterrupt) with CancelFlightServer() as server: client = FlightClient(("localhost", server.port)) reader = client.do_get(flight.Ticket(b"")) > test(reader.read_all)pyarrow/tests/test_flight.py:1952: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ read_all = def test(read_all): try: try: t = threading.Thread(target=signal_from_thread) with pytest.raises(exc_types) as exc_info: t.start() read_all() finally: t.join() except KeyboardInterrupt: # In case KeyboardInterrupt didn't interrupt read_all # above, at least prevent it from stopping the test suite pytest.fail("KeyboardInterrupt didn't interrupt Flight read_all") e = exc_info.value.__context__ > assert isinstance(e, pa.ArrowCancelled) or \ isinstance(e, KeyboardInterrupt) E AssertionError: assert (False or False) E + where False = isinstance(None, ) E + where = pa.ArrowCancelled E + and False = isinstance(None, KeyboardInterrupt)pyarrow/tests/test_flight.py:1945: AssertionError >> entering PDB >> > >> PDB post_mortem >> (IO-capturing turned >> off) >> > /tmp/arrow/apache-arrow-7.0.0/python/pyarrow/tests/test_flight.py(1945)test() -> assert isinstance(e, pa.ArrowCancelled) or \ (Pdb) p e None (Pdb) p exc_info (Pdb) p exc_info.value ArrowCancelled('Operation cancelled. Detail: received signal 2') (Pdb) p exc_info.value.__context__ None {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15436) [Release][Python] Disable verification of gdb tests on windows and a flaky test on apple M1
Krisztian Szucs created ARROW-15436: --- Summary: [Release][Python] Disable verification of gdb tests on windows and a flaky test on apple M1 Key: ARROW-15436 URL: https://issues.apache.org/jira/browse/ARROW-15436 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 8.0.0 See verification problems occured in https://github.com/apache/arrow/pull/12235 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15435) [C++][Doc] Improve API docs coverage
Antoine Pitrou created ARROW-15435: -- Summary: [C++][Doc] Improve API docs coverage Key: ARROW-15435 URL: https://issues.apache.org/jira/browse/ARROW-15435 Project: Apache Arrow Issue Type: Task Components: C++, Documentation Reporter: Antoine Pitrou We don't often update the API docs when adding new APIs, so chances are some classes or functions are not exposed in the API docs. We should make a pass to find out missing APIs and add them. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15434) `add_column` method missing in pyarrow.RecordBatch
Niranda Perera created ARROW-15434: -- Summary: `add_column` method missing in pyarrow.RecordBatch Key: ARROW-15434 URL: https://issues.apache.org/jira/browse/ARROW-15434 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Niranda Perera `add_column` method missing in `pyarrow.RecordBatch` object (It's available for `Table` though). I think it's a simple, yet important functionality :) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15433) [Doc] Warnings when buiding docs
Antoine Pitrou created ARROW-15433: -- Summary: [Doc] Warnings when buiding docs Key: ARROW-15433 URL: https://issues.apache.org/jira/browse/ARROW-15433 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Antoine Pitrou {code} /home/antoine/miniconda3/envs/pyarrow/lib/python3.9/site-packages/numpydoc/docscrape.py:418: UserWarning: Unknown section Results in the docstring of in None. warn(msg) /home/antoine/miniconda3/envs/pyarrow/lib/python3.9/site-packages/numpydoc/docscrape.py:418: UserWarning: Unknown section Results in the docstring of in None. warn(msg) /home/antoine/arrow/dev/docs/source/developers/guide/resources.rst:50: WARNING: Bullet list ends without a blank line; unexpected unindent. /home/antoine/arrow/dev/docs/source/python/api/dataset.rst:46: WARNING: autosummary: failed to import ORCFileFormat /home/antoine/arrow/dev/docs/source/python/api/dataset.rst:46: WARNING: autosummary: failed to import FragmentScanOptions /home/antoine/arrow/dev/docs/source/python/api/plasma.rst:28: WARNING: autosummary: failed to import ObjectID /home/antoine/arrow/dev/docs/source/python/api/plasma.rst:28: WARNING: autosummary: failed to import PlasmaClient /home/antoine/arrow/dev/docs/source/python/api/plasma.rst:28: WARNING: autosummary: failed to import PlasmaBuffer /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "ScalarAggregateOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "CountOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "ModeOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "VarianceOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "QuantileOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "TDigestOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "IndexOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "ArithmeticOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "ElementWiseAggregateOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "RoundOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "RoundTemporalOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "RoundToMultipleOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration: Expected identifier in nested name, got keyword: static [error at 23] static constexpr static char const kTypeName [] = "JoinOptions" ---^ /home/antoine/arrow/dev/docs/source/cpp/api/compute.rst:51: WARNING: Invalid C++ declaration:
[jira] [Created] (ARROW-15432) [Python] Address CSV docstrings
Alessandro Molina created ARROW-15432: - Summary: [Python] Address CSV docstrings Key: ARROW-15432 URL: https://issues.apache.org/jira/browse/ARROW-15432 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Assignee: Alenka Frim Fix For: 8.0.0 Ensure /docs/python/generated/pyarrow.csv.read_csv.html has an {{Examples}} section -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15431) [Python] Address docstrings in Schema
Alessandro Molina created ARROW-15431: - Summary: [Python] Address docstrings in Schema Key: ARROW-15431 URL: https://issues.apache.org/jira/browse/ARROW-15431 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Assignee: Alenka Frim Fix For: 8.0.0 Ensure all docstrings of classes and methods in /docs/python/generated/pyarrow.Schema.html have an {{Examples}} section. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15430) [Python] Address docstrings in Filesystem classes and functions
Alessandro Molina created ARROW-15430: - Summary: [Python] Address docstrings in Filesystem classes and functions Key: ARROW-15430 URL: https://issues.apache.org/jira/browse/ARROW-15430 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Assignee: Alenka Frim Fix For: 8.0.0 Ensure all docstrings in https://arrow.apache.org/docs/python/api/files.html and https://arrow.apache.org/docs/python/api/filesystems.html have an {{Examples}} section -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15429) [Python] Address docstrings in Table and related classes and functions
Alessandro Molina created ARROW-15429: - Summary: [Python] Address docstrings in Table and related classes and functions Key: ARROW-15429 URL: https://issues.apache.org/jira/browse/ARROW-15429 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Assignee: Alenka Frim Fix For: 8.0.0 Address docstrings of all classes and functions mentioned in https://arrow.apache.org/docs/python/api/tables.html -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15428) [Python] Address docstrings in Parquet classes and functions
Alessandro Molina created ARROW-15428: - Summary: [Python] Address docstrings in Parquet classes and functions Key: ARROW-15428 URL: https://issues.apache.org/jira/browse/ARROW-15428 Project: Apache Arrow Issue Type: Sub-task Components: Python Reporter: Alessandro Molina Assignee: Alenka Frim Fix For: 8.0.0 Address docstrings of all classes and functions referenced in https://arrow.apache.org/docs/python/api/formats.html#parquet-files -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15427) [C++][Gandiva] Dead lock in cache
Nate Clark created ARROW-15427: -- Summary: [C++][Gandiva] Dead lock in cache Key: ARROW-15427 URL: https://issues.apache.org/jira/browse/ARROW-15427 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Nate Clark Assignee: Nate Clark Encountered a dead lock a few times trying to lock the mutex in gandiva::Cache. Not sure how this was occurring but using `std::lock_guard` to hold the lock stopped it from happening. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15426) [C++] [Gandiva] InExpression validation does not support date/time types
Nate Clark created ARROW-15426: -- Summary: [C++] [Gandiva] InExpression validation does not support date/time types Key: ARROW-15426 URL: https://issues.apache.org/jira/browse/ARROW-15426 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Nate Clark Assignee: Nate Clark Expression validation for InExpression does not properly handle when an in expression is created for date/time types. It requires that the node return_type is the primitive type and not date/time type. Evaluation expression for IN clause returns timestamp[ms] values are of typeint64 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15425) [Integration] Add delta dictionaries in file format to integration tests
David Li created ARROW-15425: Summary: [Integration] Add delta dictionaries in file format to integration tests Key: ARROW-15425 URL: https://issues.apache.org/jira/browse/ARROW-15425 Project: Apache Arrow Issue Type: Improvement Components: C++, Integration Reporter: David Li ARROW-13467 enables delta dictionary support in the IPC file format for C++ so we should cover it in integration tests as well (and presumably add it to the 'golden' test files?) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15424) [GLib] Update bindings for MemoryManager::AllocateBuffer
David Li created ARROW-15424: Summary: [GLib] Update bindings for MemoryManager::AllocateBuffer Key: ARROW-15424 URL: https://issues.apache.org/jira/browse/ARROW-15424 Project: Apache Arrow Issue Type: Bug Components: GLib Reporter: David Li Assignee: David Li Caused by ARROW-15373. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15423) [C++][Dev] Make GDB plugin auto-load friendly
Antoine Pitrou created ARROW-15423: -- Summary: [C++][Dev] Make GDB plugin auto-load friendly Key: ARROW-15423 URL: https://issues.apache.org/jira/browse/ARROW-15423 Project: Apache Arrow Issue Type: Task Components: C++, Developer Tools Reporter: Antoine Pitrou Fix For: 8.0.0 Ideally the GDB script should be usable both as an auto-load script and as a target for manual {{source}} invocations. The GLib pretty-printer does it like this: {code:python} def register(obj): if obj is None: obj = gdb obj.pretty_printers.append(pretty_printer_lookup) register(gdb.current_objfile()) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15422) [C++][Packaging] Package gdb plugin
Antoine Pitrou created ARROW-15422: -- Summary: [C++][Packaging] Package gdb plugin Key: ARROW-15422 URL: https://issues.apache.org/jira/browse/ARROW-15422 Project: Apache Arrow Issue Type: Task Components: C++, Packaging Reporter: Antoine Pitrou Fix For: 8.0.0 The gdb plugin should be packaged appropriately. I'm not sure where/under which form. -- This message was sent by Atlassian Jira (v8.20.1#820001)