[jira] [Created] (ARROW-10708) [Packaging][deb] Add support for Ubuntu 20.10

2020-11-23 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10708:


 Summary: [Packaging][deb] Add support for Ubuntu 20.10
 Key: ARROW-10708
 URL: https://issues.apache.org/jira/browse/ARROW-10708
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10707) [Python][Parquet] Enhance hive partition filtering with 'like' operator

2020-11-23 Thread Weiyang Zhao (Jira)
Weiyang Zhao created ARROW-10707:


 Summary: [Python][Parquet] Enhance hive partition filtering with 
'like' operator
 Key: ARROW-10707
 URL: https://issues.apache.org/jira/browse/ARROW-10707
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Weiyang Zhao
Assignee: Weiyang Zhao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10706) [Python][Parquet] when filters end up with no partition, it will throw index out of range error.

2020-11-23 Thread Weiyang Zhao (Jira)
Weiyang Zhao created ARROW-10706:


 Summary: [Python][Parquet] when filters end up with no partition, 
it will throw index out of range error.
 Key: ARROW-10706
 URL: https://issues.apache.org/jira/browse/ARROW-10706
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Weiyang Zhao
Assignee: Weiyang Zhao


The below code will raise IndexError:

{{dataset = pq.ParquetDataset(}}
{{ base_path, filesystem=fs,}}
{{ filters=[('string', '=', "notExisted")],}}
{{ use_legacy_dataset=True}}
{{)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10705) [Rust] Lifetime annotations in the IPC writer are too strict, preventing code reuse

2020-11-23 Thread Carol Nichols (Jira)
Carol Nichols created ARROW-10705:
-

 Summary: [Rust] Lifetime annotations in the IPC writer are too 
strict, preventing code reuse
 Key: ARROW-10705
 URL: https://issues.apache.org/jira/browse/ARROW-10705
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Carol Nichols
Assignee: Carol Nichols


Will illustrate and explain more in the PR I'm about to open.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10704) Remove Nested from expression enum

2020-11-23 Thread Jira
Daniël Heres created ARROW-10704:


 Summary: Remove Nested from expression enum
 Key: ARROW-10704
 URL: https://issues.apache.org/jira/browse/ARROW-10704
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Daniël Heres


Remove Nested from expression enum. It's not needed and never produced/used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10703) [Rust] [DataFusion] Make join not collect left on every part

2020-11-23 Thread Jira
Jorge Leitão created ARROW-10703:


 Summary: [Rust] [DataFusion] Make join not collect left on every 
part
 Key: ARROW-10703
 URL: https://issues.apache.org/jira/browse/ARROW-10703
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10702) [C++] Micro-optimize integer parsing

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10702:
--

 Summary: [C++] Micro-optimize integer parsing
 Key: ARROW-10702
 URL: https://issues.apache.org/jira/browse/ARROW-10702
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


It might be possible to optimize integer and decimal parsing using the 
following tricks from the {{fast_float}} library:
https://github.com/lemire/fast_float/blob/70c9b7f884c7f80a9a0e06fa9754c0a2e6a9492e/include/fast_float/ascii_number.h#L18-L38




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10701) [Rust] [Datafusion] Benchmark sort_limit_query_sql fails because order by clause specifies column index instead of expression

2020-11-23 Thread Jira
Jörn Horstmann created ARROW-10701:
--

 Summary: [Rust] [Datafusion] Benchmark sort_limit_query_sql fails 
because order by clause specifies column index instead of expression
 Key: ARROW-10701
 URL: https://issues.apache.org/jira/browse/ARROW-10701
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jörn Horstmann


I probably introduced this bug some time ago, but there was another bug in the 
benchmark setup that caused the query to not be executed, only planned.

Datafusion should probably also support queries like

SELECT foo, bar
  FROM table
 ORDER BY 1, 2

But for now the easiest fix for the benchmark would be to specify the column 
name instead of the index.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10700) [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10700:
--

 Summary: [C++] Warning "ignoring unknown option '-mbmi2'" on MSVC
 Key: ARROW-10700
 URL: https://issues.apache.org/jira/browse/ARROW-10700
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


Seen on Github Actions:
https://github.com/apache/arrow/pull/8716/checks?check_run_id=1442252599#step:7:792

{code}
  Generating Code...
  level_comparison_avx2.cc
cl : command line warning D9002: ignoring unknown option '-mbmi2' 
[D:\a\arrow\arrow\build\cpp\src\parquet\parquet_shared.vcxproj]
  level_conversion_bmi2.cc
{code}

This may possibly affect performance too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10699) [C++] BitmapUInt64Reader doesn't work on big-endian

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10699:
--

 Summary: [C++] BitmapUInt64Reader doesn't work on big-endian
 Key: ARROW-10699
 URL: https://issues.apache.org/jira/browse/ARROW-10699
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


I didn't notice this when merging ARROW-10655 (the s390x CI is allowed to fail).
https://travis-ci.com/github/apache/arrow/jobs/445803711#L3534





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10698) [C++] Optimize union equality comparison

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10698:
--

 Summary: [C++] Optimize union equality comparison
 Key: ARROW-10698
 URL: https://issues.apache.org/jira/browse/ARROW-10698
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


Currently, union array comparison in {{ArrayRangeEqual}} computes child 
equality over single union elements. This adds a large per-element comparison 
overhead. At least for sparse unions, it may be beneficial to detect contiguous 
runs of child ids and run child comparisons on entire runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10697) [C++] Consolidate bitmap word readers

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10697:
--

 Summary: [C++] Consolidate bitmap word readers
 Key: ARROW-10697
 URL: https://issues.apache.org/jira/browse/ARROW-10697
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


We currently have {{BitmapWordReader}}, {{BitmapUInt64Reader}} and 
{{Bitmap::VisitWords}}.

We should try to consolidate those, assuming benchmarks don't regress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10696) [C++] Investigate a bit run reader that would only return runs of set bits

2020-11-23 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10696:
--

 Summary: [C++] Investigate a bit run reader that would only return 
runs of set bits
 Key: ARROW-10696
 URL: https://issues.apache.org/jira/browse/ARROW-10696
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Followup to PR discussion: 
https://github.com/apache/arrow/pull/8703#discussion_r526263665



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2020-11-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-10695:
-

 Summary: [C++][Dataset] Allow to use a UUID in the 
basename_template when writing a dataset
 Key: ARROW-10695
 URL: https://issues.apache.org/jira/browse/ARROW-10695
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently we allow the user to specify a {{basename_template}}, and this can 
include a {{"\{i\}"}} part to replace it with an automatically incremented 
integer (so each generated file written to a single partition is unique):

https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717

It _might_ be useful to also have the ability to use a UUID, to ensure the file 
is unique in general (not only for a single write) and to mimic the behaviour 
of the old {{write_to_dataset}} implementation.

For example, we could look for a {{"\{uuid\}"}} in the template string, and if 
present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-10694:
---

 Summary: [Python] ds.write_dataset() generates empty files for 
each final partition
 Key: ARROW-10694
 URL: https://issues.apache.org/jira/browse/ARROW-10694
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: Ubuntu 18.04
Python 3.8.6
adlfs master branch
Reporter: Lance Dacey


ds.write_dataset() is generating empty files for the final partition folder 
which causes errors when reading the dataset or converting a dataset to a table.

I believe this may be caused by fs.mkdir(). Without the final slash in the 
path, an empty file is created in the "dev" container:

 
{code:java}
fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
account_key=base.password)
fs.mkdir("dev/test2")
{code}
 

If the final slash is added, a proper folder is created:
{code:java}
fs.mkdir("dev/test2/"){code}
 

Here is a full example of what happens with ds.write_dataset:
{code:java}
schema = pa.schema(
[
("year", pa.int16()),
("month", pa.int8()),
("day", pa.int8()),
("report_date", pa.date32()),
("employee_id", pa.string()),
("designation", pa.dictionary(index_type=pa.int16(), 
value_type=pa.string())),
]
)

part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
pa.int8()), ("day", pa.int8())]))

ds.write_dataset(data=table, 
 base_dir="dev/test-dataset", 
 basename_template="test-{i}.parquet", 
 format="parquet",
 partitioning=part, 
 schema=schema,
 filesystem=fs)

dataset.files

#sample printed below, note the empty files
[
 'dev/test-dataset/2018/1/1/test-0.parquet',
 'dev/test-dataset/2018/10/1',
 'dev/test-dataset/2018/10/1/test-27.parquet',
 'dev/test-dataset/2018/3/1',
 'dev/test-dataset/2018/3/1/test-6.parquet',
 'dev/test-dataset/2020/1/1',
 'dev/test-dataset/2020/1/1/test-2.parquet',
 'dev/test-dataset/2020/10/1',
 'dev/test-dataset/2020/10/1/test-29.parquet',
 'dev/test-dataset/2020/11/1',
 'dev/test-dataset/2020/11/1/test-32.parquet',
 'dev/test-dataset/2020/2/1',
 'dev/test-dataset/2020/2/1/test-5.parquet',
 'dev/test-dataset/2020/7/1',
 'dev/test-dataset/2020/7/1/test-20.parquet',
 'dev/test-dataset/2020/8/1',
 'dev/test-dataset/2020/8/1/test-23.parquet',
 'dev/test-dataset/2020/9/1',
 'dev/test-dataset/2020/9/1/test-26.parquet'
]{code}
As you can see, there is an empty file for each "day" partition. I was not even 
able to read the dataset at all until I manually deleted the first empty file 
in the dataset (2018/1/1).

I then get an error when I try to use the to_table() method:
{code:java}
OSError   Traceback (most recent call last)
 in 
> 1 
dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
in 
pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
 in 
pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
 in 
pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()OSError: Could not open parquet input source 
'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
{code}
If I manually delete the empty file, I can then use the to_table() function:
{code:java}
dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
10)).to_pandas()
{code}
Is this a bug with pyarrow, adlfs, or fsspec?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)