[jira] [Created] (ARROW-16647) [C++] Add support for unique(), value_counts(), dictionary_encode() with interval types
Keisuke Okada created ARROW-16647: - Summary: [C++] Add support for unique(), value_counts(), dictionary_encode() with interval types Key: ARROW-16647 URL: https://issues.apache.org/jira/browse/ARROW-16647 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Keisuke Okada -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16646) [C++] HashJoin node can crash if a key column is a scalar
Weston Pace created ARROW-16646: --- Summary: [C++] HashJoin node can crash if a key column is a scalar Key: ARROW-16646 URL: https://issues.apache.org/jira/browse/ARROW-16646 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace This only happens when the node has a bloom filter pushed down into it. In that case it will attempt to hash the key columns in {{arrow::compute::HashJoinBasicImpl::ApplyBloomFiltersToBatch}} by calling {{Hashing32::HashBatch}} on a batch made up only of key columns. If one of those key columns happens to be a scalar, and not an array, then this method triggers a {{DCHECK}} and crashes. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16645) Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container
Matthew Roeschke created ARROW-16645: Summary: Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container Key: ARROW-16645 URL: https://issues.apache.org/jira/browse/ARROW-16645 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 7.0.0 Reporter: Matthew Roeschke Example: {code:java} In [15]: import pyarrow as pa In [16]: pa.array([1, pa.NA]) ArrowInvalid: Could not convert with type pyarrow.lib.NullScalar: did not recognize Python value type when inferring an Arrow data type{code} I would be great if this could be equivalent to {code:java} In [17]: pa.array([1, pa.NA], mask=[False, True]) Out[17]: [ 1, null ] In [18]: pa.__version__ Out[18]: '7.0.0'{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16644) [C++] Unsuppress -Wno-return-stack-address
David Li created ARROW-16644: Summary: [C++] Unsuppress -Wno-return-stack-address Key: ARROW-16644 URL: https://issues.apache.org/jira/browse/ARROW-16644 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: David Li Follow up for ARROW-16643: this code in {{small_vector_benchmark.cc}} generates a warning on clang-14 that we should unsuppress {code:cpp} template ARROW_NOINLINE int64_t ConsumeVector(Vector v) { return reinterpret_cast(v.data()); } template ARROW_NOINLINE int64_t IngestVector(const Vector& v) { return reinterpret_cast(v.data()); } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16643) [C++] Fix -Werror CHECKIN build with clang-14
Wes McKinney created ARROW-16643: Summary: [C++] Fix -Werror CHECKIN build with clang-14 Key: ARROW-16643 URL: https://issues.apache.org/jira/browse/ARROW-16643 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 9.0.0 With clang-14, the C++ build fails on a handful of new warnings including {{-Wreturn-stack-address}}. Will submit patch -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16642) An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.
yurikoomiga created ARROW-16642: --- Summary: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data. Key: ARROW-16642 URL: https://issues.apache.org/jira/browse/ARROW-16642 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 8.0.0 Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0 pyarrow 7.0.0 ubuntu 9.4.0 python3.8, Reporter: yurikoomiga Attachments: test_std_02.py Hi All When I use Arrow Reading Parquet File like follow: ``` auto st = parquet::arrow::FileReader::Make( arrow::default_memory_pool(), parquet::ParquetFileReader::Open(_parquet, _properties), &_reader); arrow::Status status = _reader->GetRecordBatchReader(\{_current_group},_parquet_column_ids, &_rb_batch); _reader->set_batch_size(65536); _reader->set_use_threads(true); status = _rb_batch->ReadNext(&_batch); ` ``` status is not ok and an error occured like this: `IOError: Corrupt snappy compressed data.` When I comment out this statement ` _reader->set_use_threads(true);`,The program runs normally and I can read parquet file well. Program errors only occur when I read multiple columns and using `_reader->set_use_threads(true); `and a single column will not occur error The testing parquet file is created by pyarrow,I use only 1 group and each group has 300 records. The parquet file has 20 columns including int and string types you can create a test parquet file using attachment python script Reading file using C++,arrow 7.0.0 ,snappy 1.1.8 Writting file using python3.8 ,pyarrow 7.0.0 Looking forward to your reply Thank you! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16641) [R] How to filter array columns?
Vladimir created ARROW-16641: Summary: [R] How to filter array columns? Key: ARROW-16641 URL: https://issues.apache.org/jira/browse/ARROW-16641 Project: Apache Arrow Issue Type: Wish Components: R Reporter: Vladimir Fix For: 8.0.0 In the parquet data we have, there is a column with the array data type ({*}list>{*}), which flags records that have different issues. For each record, multiple values could be stored in the column. For example, `{_}[A, B, C]{_}`. I'm trying to perform a data filtering step and exclude some flagged records. Filtering is trivial for the regular columns that contain just a single value. E.g., {{flags_to_exclude <- c("A", "B")}} {{datt %>% filter(! col %in% flags_to_exclude)}} Given the array column, is it possible to exclude records with at least one of the flags from `flags_to_exclude` using the arrow R package? I really appreciate any advice you can provide! -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16640) [Java][CI] Add testing to current java-jars builds and extend to JDK 11 and 17
Raúl Cumplido created ARROW-16640: - Summary: [Java][CI] Add testing to current java-jars builds and extend to JDK 11 and 17 Key: ARROW-16640 URL: https://issues.apache.org/jira/browse/ARROW-16640 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Java Reporter: Raúl Cumplido As discussed on [https://github.com/apache/arrow/pull/13157#issuecomment-1132907282] when we execute the crossbow build to the java-jars task we are not testing the built jars. Currently we are also only building the jars for JDK 1.8. We should both: * Run Java tests over the built jars * Build jars for other JDKs supported (11 and 17) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16639) [R] Improve package installation on Fedora 36
Dewey Dunnington created ARROW-16639: Summary: [R] Improve package installation on Fedora 36 Key: ARROW-16639 URL: https://issues.apache.org/jira/browse/ARROW-16639 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington Fedora 36 includes both a 'libarrow' and a 'libarrow-dataset' package under pkg-config names 'arrow' and 'arrow-dataset', respectively. When we do {{install.packages("arrow")}} on Fedora 36, the pkg-config for 'arrow' gets picked up first, and we get a minimal arrow install. Installing with {{NOT_CRAN=true ARROW_USE_PKG_CONFIG=false}} results in a full compiled installation because binaries are not available for fedora-36 (according to the install output). Reported by Roger Bivand (while testing GDAL + Arrow support since Fedora 36 is the easiest place to do that due to system Arrow): https://github.com/paleolimbot/narrow/issues/7 Reproducble using the {{fedora:36}} docker container: {code:bash} # docker run --rm -it fedora:36 dnf install -y R libarrow-dataset-devel cmake R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/;)' R -e 'arrow::arrow_info()$capabilities' {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)