[jira] [Created] (ARROW-10708) [Packaging][deb] Add support for Ubuntu 20.10
Kouhei Sutou created ARROW-10708: Summary: [Packaging][deb] Add support for Ubuntu 20.10 Key: ARROW-10708 URL: https://issues.apache.org/jira/browse/ARROW-10708 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10707) [Python][Parquet] Enhance hive partition filtering with 'like' operator
Weiyang Zhao created ARROW-10707: Summary: [Python][Parquet] Enhance hive partition filtering with 'like' operator Key: ARROW-10707 URL: https://issues.apache.org/jira/browse/ARROW-10707 Project: Apache Arrow Issue Type: Improvement Reporter: Weiyang Zhao Assignee: Weiyang Zhao -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10706) [Python][Parquet] when filters end up with no partition, it will throw index out of range error.
Weiyang Zhao created ARROW-10706: Summary: [Python][Parquet] when filters end up with no partition, it will throw index out of range error. Key: ARROW-10706 URL: https://issues.apache.org/jira/browse/ARROW-10706 Project: Apache Arrow Issue Type: Bug Reporter: Weiyang Zhao Assignee: Weiyang Zhao The below code will raise IndexError: {{dataset = pq.ParquetDataset(}} {{ base_path, filesystem=fs,}} {{ filters=[('string', '=', "notExisted")],}} {{ use_legacy_dataset=True}} {{)}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10705) [Rust] Lifetime annotations in the IPC writer are too strict, preventing code reuse
Carol Nichols created ARROW-10705: - Summary: [Rust] Lifetime annotations in the IPC writer are too strict, preventing code reuse Key: ARROW-10705 URL: https://issues.apache.org/jira/browse/ARROW-10705 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Carol Nichols Assignee: Carol Nichols Will illustrate and explain more in the PR I'm about to open. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10704) Remove Nested from expression enum
Daniël Heres created ARROW-10704: Summary: Remove Nested from expression enum Key: ARROW-10704 URL: https://issues.apache.org/jira/browse/ARROW-10704 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Daniël Heres Remove Nested from expression enum. It's not needed and never produced/used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10703) [Rust] [DataFusion] Make join not collect left on every part
Jorge Leitão created ARROW-10703: Summary: [Rust] [DataFusion] Make join not collect left on every part Key: ARROW-10703 URL: https://issues.apache.org/jira/browse/ARROW-10703 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10702) [C++] Micro-optimize integer parsing
Antoine Pitrou created ARROW-10702: -- Summary: [C++] Micro-optimize integer parsing Key: ARROW-10702 URL: https://issues.apache.org/jira/browse/ARROW-10702 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou It might be possible to optimize integer and decimal parsing using the following tricks from the {{fast_float}} library: https://github.com/lemire/fast_float/blob/70c9b7f884c7f80a9a0e06fa9754c0a2e6a9492e/include/fast_float/ascii_number.h#L18-L38 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10701) [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression
Jörn Horstmann created ARROW-10701: -- Summary: [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression Key: ARROW-10701 URL: https://issues.apache.org/jira/browse/ARROW-10701 Project: Apache Arrow Issue Type: Bug Reporter: Jörn Horstmann I probably introduced this bug some time ago, but there was another bug in the benchmark setup that caused the query to not be executed, only planned. Datafusion should probably also support queries like SELECT foo, bar FROM table ORDER BY 1, 2 But for now the easiest fix for the benchmark would be to specify the column name instead of the index. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10700) [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC
Antoine Pitrou created ARROW-10700: -- Summary: [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC Key: ARROW-10700 URL: https://issues.apache.org/jira/browse/ARROW-10700 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou Seen on Github Actions: https://github.com/apache/arrow/pull/8716/checks?check_run_id=1442252599#step:7:792 {code} Generating Code... level_comparison_avx2.cc cl : command line warning D9002: ignoring unknown option '-mbmi2' [D:\a\arrow\arrow\build\cpp\src\parquet\parquet_shared.vcxproj] level_conversion_bmi2.cc {code} This may possibly affect performance too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian
Antoine Pitrou created ARROW-10699: -- Summary: [C++] BitmapUInt64Reader doesn't work on big-endian Key: ARROW-10699 URL: https://issues.apache.org/jira/browse/ARROW-10699 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to fail). https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10698) [C++] Optimize union equality comparison
Antoine Pitrou created ARROW-10698: -- Summary: [C++] Optimize union equality comparison Key: ARROW-10698 URL: https://issues.apache.org/jira/browse/ARROW-10698 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou Currently, union array comparison in {{ArrayRangeEqual}} computes child equality over single union elements. This adds a large per-element comparison overhead. At least for sparse unions, it may be beneficial to detect contiguous runs of child ids and run child comparisons on entire runs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10697) [C++] Consolidate bitmap word readers
Antoine Pitrou created ARROW-10697: -- Summary: [C++] Consolidate bitmap word readers Key: ARROW-10697 URL: https://issues.apache.org/jira/browse/ARROW-10697 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou We currently have {{BitmapWordReader}}, {{BitmapUInt64Reader}} and {{Bitmap::VisitWords}}. We should try to consolidate those, assuming benchmarks don't regress. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10696) [C++] Investigate a bit run reader that would only return runs of set bits
Antoine Pitrou created ARROW-10696: -- Summary: [C++] Investigate a bit run reader that would only return runs of set bits Key: ARROW-10696 URL: https://issues.apache.org/jira/browse/ARROW-10696 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Followup to PR discussion: https://github.com/apache/arrow/pull/8703#discussion_r526263665 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
Joris Van den Bossche created ARROW-10695: - Summary: [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset Key: ARROW-10695 URL: https://issues.apache.org/jira/browse/ARROW-10695 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently we allow the user to specify a {{basename_template}}, and this can include a {{"\{i\}"}} part to replace it with an automatically incremented integer (so each generated file written to a single partition is unique): https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 It _might_ be useful to also have the ability to use a UUID, to ensure the file is unique in general (not only for a single write) and to mimic the behaviour of the old {{write_to_dataset}} implementation. For example, we could look for a {{"\{uuid\}"}} in the template string, and if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
Lance Dacey created ARROW-10694: --- Summary: [Python] ds.write_dataset() generates empty files for each final partition Key: ARROW-10694 URL: https://issues.apache.org/jira/browse/ARROW-10694 Project: Apache Arrow Issue Type: Bug Affects Versions: 2.0.0 Environment: Ubuntu 18.04 Python 3.8.6 adlfs master branch Reporter: Lance Dacey ds.write_dataset() is generating empty files for the final partition folder which causes errors when reading the dataset or converting a dataset to a table. I believe this may be caused by fs.mkdir(). Without the final slash in the path, an empty file is created in the "dev" container: {code:java} fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) fs.mkdir("dev/test2") {code} If the final slash is added, a proper folder is created: {code:java} fs.mkdir("dev/test2/"){code} Here is a full example of what happens with ds.write_dataset: {code:java} schema = pa.schema( [ ("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8()), ("report_date", pa.date32()), ("employee_id", pa.string()), ("designation", pa.dictionary(index_type=pa.int16(), value_type=pa.string())), ] ) part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())])) ds.write_dataset(data=table, base_dir="dev/test-dataset", basename_template="test-{i}.parquet", format="parquet", partitioning=part, schema=schema, filesystem=fs) dataset.files #sample printed below, note the empty files [ 'dev/test-dataset/2018/1/1/test-0.parquet', 'dev/test-dataset/2018/10/1', 'dev/test-dataset/2018/10/1/test-27.parquet', 'dev/test-dataset/2018/3/1', 'dev/test-dataset/2018/3/1/test-6.parquet', 'dev/test-dataset/2020/1/1', 'dev/test-dataset/2020/1/1/test-2.parquet', 'dev/test-dataset/2020/10/1', 'dev/test-dataset/2020/10/1/test-29.parquet', 'dev/test-dataset/2020/11/1', 'dev/test-dataset/2020/11/1/test-32.parquet', 'dev/test-dataset/2020/2/1', 'dev/test-dataset/2020/2/1/test-5.parquet', 'dev/test-dataset/2020/7/1', 'dev/test-dataset/2020/7/1/test-20.parquet', 'dev/test-dataset/2020/8/1', 'dev/test-dataset/2020/8/1/test-23.parquet', 'dev/test-dataset/2020/9/1', 'dev/test-dataset/2020/9/1/test-26.parquet' ]{code} As you can see, there is an empty file for each "day" partition. I was not even able to read the dataset at all until I manually deleted the first empty file in the dataset (2018/1/1). I then get an error when I try to use the to_table() method: {code:java} OSError Traceback (most recent call last) in > 1 dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()OSError: Could not open parquet input source 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes {code} If I manually delete the empty file, I can then use the to_table() function: {code:java} dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 10)).to_pandas() {code} Is this a bug with pyarrow, adlfs, or fsspec? -- This message was sent by Atlassian Jira (v8.3.4#803005)