[jira] [Created] (ARROW-16647) [C++] Add support for unique(), value_counts(), dictionary_encode() with interval types

2022-05-24 Thread Keisuke Okada (Jira)
Keisuke Okada created ARROW-16647:
-

 Summary: [C++] Add support for unique(), value_counts(), 
dictionary_encode() with interval types
 Key: ARROW-16647
 URL: https://issues.apache.org/jira/browse/ARROW-16647
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Keisuke Okada






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16646) [C++] HashJoin node can crash if a key column is a scalar

2022-05-24 Thread Weston Pace (Jira)
Weston Pace created ARROW-16646:
---

 Summary: [C++] HashJoin node can crash if a key column is a scalar
 Key: ARROW-16646
 URL: https://issues.apache.org/jira/browse/ARROW-16646
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


This only happens when the node has a bloom filter pushed down into it.  In 
that case it will attempt to hash the key columns in 
{{arrow::compute::HashJoinBasicImpl::ApplyBloomFiltersToBatch}} by calling 
{{Hashing32::HashBatch}} on a batch made up only of key columns.

If one of those key columns happens to be a scalar, and not an array, then this 
method triggers a {{DCHECK}} and crashes.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16645) Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container

2022-05-24 Thread Matthew Roeschke (Jira)
Matthew Roeschke created ARROW-16645:


 Summary: Allow pa.array/pa.chunked_array to infer pa.NA when in a 
non pyarrow container
 Key: ARROW-16645
 URL: https://issues.apache.org/jira/browse/ARROW-16645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 7.0.0
Reporter: Matthew Roeschke


Example:

 

 
{code:java}
In [15]: import pyarrow as pa

In [16]: pa.array([1, pa.NA])
ArrowInvalid: Could not convert  with type 
pyarrow.lib.NullScalar: did not recognize Python value type when inferring an 
Arrow data type{code}
 

I would be great if this could be equivalent to
{code:java}
In [17]: pa.array([1, pa.NA], mask=[False, True])
Out[17]:

[
  1,
  null
]


In [18]: pa.__version__
Out[18]: '7.0.0'{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16644) [C++] Unsuppress -Wno-return-stack-address

2022-05-24 Thread David Li (Jira)
David Li created ARROW-16644:


 Summary: [C++] Unsuppress -Wno-return-stack-address
 Key: ARROW-16644
 URL: https://issues.apache.org/jira/browse/ARROW-16644
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: David Li


Follow up for ARROW-16643: this code in {{small_vector_benchmark.cc}} generates 
a warning on clang-14 that we should unsuppress

{code:cpp}
template 
ARROW_NOINLINE int64_t ConsumeVector(Vector v) {
  return reinterpret_cast(v.data());
}

template 
ARROW_NOINLINE int64_t IngestVector(const Vector& v) {
  return reinterpret_cast(v.data());
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16643) [C++] Fix -Werror CHECKIN build with clang-14

2022-05-24 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-16643:


 Summary: [C++] Fix -Werror CHECKIN build with clang-14
 Key: ARROW-16643
 URL: https://issues.apache.org/jira/browse/ARROW-16643
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 9.0.0


With clang-14, the C++ build fails on a handful of new warnings including 
{{-Wreturn-stack-address}}. Will submit patch



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16642) An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

2022-05-24 Thread yurikoomiga (Jira)
yurikoomiga created ARROW-16642:
---

 Summary: An Error Occured While Reading Parquet File Using C++ - 
GetRecordBatchReader -Corrupt snappy compressed data. 
 Key: ARROW-16642
 URL: https://issues.apache.org/jira/browse/ARROW-16642
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 8.0.0
 Environment: C++,arrow 7.0.0 ,snappy 1.1.8, arrow 8.0.0
pyarrow 7.0.0 ubuntu 9.4.0  python3.8,

Reporter: yurikoomiga
 Attachments: test_std_02.py

Hi All

When I use Arrow Reading Parquet File like follow:
```
auto st = parquet::arrow::FileReader::Make(
                    arrow::default_memory_pool(),
                    parquet::ParquetFileReader::Open(_parquet, _properties), 
&_reader);   
arrow::Status status = 
_reader->GetRecordBatchReader(\{_current_group},_parquet_column_ids, 
&_rb_batch);    
_reader->set_batch_size(65536);       
_reader->set_use_threads(true);      
status = _rb_batch->ReadNext(&_batch); `
``` 

status is not ok and an error occured like this:
`IOError: Corrupt snappy compressed data.`

When I comment out this statement ` _reader->set_use_threads(true);`,The 
program runs normally and I can read parquet file well.
Program errors only occur when I read multiple columns and using 
`_reader->set_use_threads(true); `and a single column will not occur error

The testing parquet file is created by pyarrow,I use only 1 group and each 
group has 300 records.
The parquet file has 20 columns including int and string types

you can create a test parquet file using attachment python script

Reading file using C++,arrow 7.0.0 ,snappy 1.1.8

Writting file using python3.8 ,pyarrow 7.0.0

Looking forward to your reply

Thank you!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16641) [R] How to filter array columns?

2022-05-24 Thread Vladimir (Jira)
Vladimir created ARROW-16641:


 Summary: [R] How to filter array columns?
 Key: ARROW-16641
 URL: https://issues.apache.org/jira/browse/ARROW-16641
 Project: Apache Arrow
  Issue Type: Wish
  Components: R
Reporter: Vladimir
 Fix For: 8.0.0


In the parquet data we have, there is a column with the array data type 
({*}list>{*}), which flags records that have different 
issues. For each record, multiple values could be stored in the column. For 
example, `{_}[A, B, C]{_}`.

I'm trying to perform a data filtering step and exclude some flagged records.

Filtering is trivial for the regular columns that contain just a single value. 
E.g.,

{{flags_to_exclude <- c("A", "B")}}

{{datt %>% filter(! col %in% flags_to_exclude)}}

 

Given the array column, is it possible to exclude records with at least one of 
the flags from `flags_to_exclude` using the arrow R package?

I really appreciate any advice you can provide!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16640) [Java][CI] Add testing to current java-jars builds and extend to JDK 11 and 17

2022-05-24 Thread Jira
Raúl Cumplido created ARROW-16640:
-

 Summary: [Java][CI] Add testing to current java-jars builds and 
extend to JDK 11 and 17
 Key: ARROW-16640
 URL: https://issues.apache.org/jira/browse/ARROW-16640
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration, Java
Reporter: Raúl Cumplido


As discussed on 
[https://github.com/apache/arrow/pull/13157#issuecomment-1132907282] when we 
execute the crossbow build to the java-jars task we are not testing the built 
jars. Currently we are also only building the jars for JDK 1.8. We should both:
 * Run Java tests over the built jars
 * Build jars for other JDKs supported (11 and 17)

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16639) [R] Improve package installation on Fedora 36

2022-05-24 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16639:


 Summary: [R] Improve package installation on Fedora 36
 Key: ARROW-16639
 URL: https://issues.apache.org/jira/browse/ARROW-16639
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


Fedora 36 includes both a 'libarrow' and a 'libarrow-dataset' package under 
pkg-config names 'arrow' and 'arrow-dataset', respectively. When we do 
{{install.packages("arrow")}} on Fedora 36, the pkg-config for 'arrow' gets 
picked up first, and we get a minimal arrow install. Installing with 
{{NOT_CRAN=true ARROW_USE_PKG_CONFIG=false}} results in a full compiled 
installation because binaries are not available for fedora-36 (according to the 
install output).

Reported by Roger Bivand (while testing GDAL + Arrow support since Fedora 36 is 
the easiest place to do that due to system Arrow): 
https://github.com/paleolimbot/narrow/issues/7

Reproducble using the {{fedora:36}} docker container:


{code:bash}
# docker run --rm -it fedora:36
dnf install -y R libarrow-dataset-devel cmake
R -e 'install.packages("arrow", repos = "https://cloud.r-project.org/;)'
R -e 'arrow::arrow_info()$capabilities'
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)