[jira] [Created] (ARROW-16984) [Ruby] Add support for installing Apache Arrow GLib automatically

2022-07-05 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16984:


 Summary: [Ruby] Add support for installing Apache Arrow GLib 
automatically
 Key: ARROW-16984
 URL: https://issues.apache.org/jira/browse/ARROW-16984
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 9.0.0


Fedora 37 or later will ship Apache Arrow GLib as {{libarrow-glib-devel}}: 
https://packages.fedoraproject.org/pkgs/libarrow/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16983) Delta byte array encoder broken due to memory leak

2022-07-05 Thread Matt DePero (Jira)
Matt DePero created ARROW-16983:
---

 Summary: Delta byte array encoder broken due to memory leak
 Key: ARROW-16983
 URL: https://issues.apache.org/jira/browse/ARROW-16983
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Reporter: Matt DePero


The `DeltaByteArrayEncoder` has a memory leak due to a bug in how 
`EstimatedDataEncodedSize` is calculated. DeltaByteArrayEncoder extends 
`encoder` which calculates EstimatedDataEncodedSize by calling `Len()` on its 
`PooledBufferWriter` sink. DeltaByteArrayEncoder however does not write data to 
sink, instead writing data to `prefixEncoder` and `suffixEncoder` causing 
EstimatedDataEncodedSize to always return zero. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [arrow-adbc] lidavidm merged pull request #29: Install the headers along with the driver manager

2022-07-05 Thread GitBox


lidavidm merged PR #29:
URL: https://github.com/apache/arrow-adbc/pull/29


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (ARROW-16982) Slow reading of partitioned parquet files from S3

2022-07-05 Thread Jira
Blaž Zupančič created ARROW-16982:
-

 Summary: Slow reading of partitioned parquet files from S3
 Key: ARROW-16982
 URL: https://issues.apache.org/jira/browse/ARROW-16982
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Parquet, Python
Affects Versions: 8.0.0
Reporter: Blaž Zupančič


When reading partitioned files from S3 and using filters to select partitions, 
the reader will send list requests each time read_table() is called.
{code:python}
# partitioning: s3://bucket/year=/month=y/day=z

from pyarrow import parquet
parquet.read_table('s3://bucket', filters=[('day', '=', 1)]) # lists s3 bucket
parquet.read_table('s3://bucket', filters=[('day', '=', 2)]) # lists again{code}
This is not a problem if done once, but repeated calls to select different 
partitions lead to a large amount of (slow and potentially expensive) S3 list 
requests.

Current workaround is to list and filter partition structure manually, however 
this is not nearly as convenient as using filters.

If we know that the S3 prefixes did not change, it should be possible to do 
recursive list only once and load different data multiple times (using only S3 
get requests). I suppose this should be possible by using ParquetDataset, 
however current implementation only allows filters in constructor and not in 
the read() method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [arrow-adbc] lidavidm opened a new pull request, #29: Install the headers along with the driver manager

2022-07-05 Thread GitBox


lidavidm opened a new pull request, #29:
URL: https://github.com/apache/arrow-adbc/pull/29

   Right now we only install the shared library.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (ARROW-16981) [C++] Expose jemalloc statistics for logging

2022-07-05 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16981:
--

 Summary: [C++] Expose jemalloc statistics for logging
 Key: ARROW-16981
 URL: https://issues.apache.org/jira/browse/ARROW-16981
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc
Assignee: Rok Mihevc


This would enable us to log memory usage and diagnose out of memory issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [arrow-adbc] lidavidm merged pull request #27: Make AdbcConnectionNew 2-adic for consistency

2022-07-05 Thread GitBox


lidavidm merged PR #27:
URL: https://github.com/apache/arrow-adbc/pull/27


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (ARROW-16980) [Python] Results of running a substrait plan against a tpch data table written into parquet are all null

2022-07-05 Thread Richard Tia (Jira)
Richard Tia created ARROW-16980:
---

 Summary: [Python] Results of running a substrait plan against a 
tpch data table written into parquet are all null
 Key: ARROW-16980
 URL: https://issues.apache.org/jira/browse/ARROW-16980
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Richard Tia
 Attachments: lineitem.json

SQL
{code:java}
SELECT l_returnflag, l_linestatus FROM lineitem{code}
 

substrait plan type info for l_returnflag:
{code:java}
{
"fixedChar": {
"length": 1,
"typeVariationReference": 0,
"nullability": "NULLABILITY_NULLABLE"
}{code}
fixedChar is an extension type.

 

Error:
{code:java}
pyarrow/table.pxi:1223: in pyarrow.lib.ChunkedArray.chunks.__get__
    ???
pyarrow/table.pxi:1241: in iterchunks
    ???
pyarrow/table.pxi:1185: in pyarrow.lib.ChunkedArray.chunk
    ???
pyarrow/public-api.pxi:200: in pyarrow.lib.pyarrow_wrap_array
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   AttributeError: 'pyarrow.lib.BaseExtensionType' object has no attribute 
'__arrow_ext_class__'

{code}
 

Reproduction Steps:
{code:java}
import pyarrow as pa
import pyarrow.substrait as substrait

from pyarrow import json as pyarrow_json
from pyarrow.lib import tobytes


substrait_query = 

json_file_path = os.path.join(, 'lineitem.json')
arrow_data_path_ipc = os.path.join(, 'substrait_data.arrow')
substrait_query = tobytes(substrait_query.replace("FILENAME_PLACEHOLDER", 
arrow_data_path_ipc))


# Save lineitem.json into IPC arrow binary file
table = pyarrow_json.read_json(json_file_path)

with pa.ipc.RecordBatchFileWriter(filepath, schema=table.schema, 
arrow_data_path_ipc) as writer:
writer.write_table(table)


# Run the substrait query plan
buf = pa._substrait._parse_json_plan(substrait_query)
reader = substrait.run_query(buf)
result = reader.read_all()

print(result.columns[0].chunks)


{code}
lineitem.json is attached

substrait query plan:
{code:java}
"""
{
  "extensionUris": [],
  "extensions": [],
  "relations": [{
"root": {
  "input": {
"project": {
  "common": {
  },
  "input": {
"read": {
  "common": {
"direct": {
}
  },
  "baseSchema": {
"names": ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", 
"L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", 
"L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", 
"L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT"],
"struct": {
  "types": [{
"i64": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"i64": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"i64": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"i32": {
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"decimal": {
  "scale": 0,
  "precision": 19,
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"decimal": {
  "scale": 0,
  "precision": 19,
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"decimal": {
  "scale": 0,
  "precision": 19,
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"decimal": {
  "scale": 0,
  "precision": 19,
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"fixedChar": {
  "length": 1,
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"fixedChar": {
  "length": 1,
  "typeVariationReference": 0,
  "nullability": "NULLABILITY_NULLABLE"
}
  }, {
"date": {
 

[jira] [Created] (ARROW-16979) [Java] Further Consolidate JNI compilation

2022-07-05 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16979:
-

 Summary: [Java] Further Consolidate JNI compilation
 Key: ARROW-16979
 URL: https://issues.apache.org/jira/browse/ARROW-16979
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Alessandro Molina
 Fix For: 10.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16978) [C#] Intermittent Archery Failures

2022-07-05 Thread Raphael Taylor-Davies (Jira)
Raphael Taylor-Davies created ARROW-16978:
-

 Summary: [C#] Intermittent Archery Failures
 Key: ARROW-16978
 URL: https://issues.apache.org/jira/browse/ARROW-16978
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Raphael Taylor-Davies


We are seeing intermittent archery failures in arrow-rs - 
[here|https://github.com/apache/arrow-rs/runs/6987393626?check_suite_focus=true]
{code:java}
FAILED TEST: datetime C# producing,  C# consuming
1 failures
  File "/arrow/dev/archery/archery/integration/runner.py", line 246, in 
_run_ipc_test_case
run_binaries(producer, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 100, in run_gold
return self._run_gold(gold_dir, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 322, in 
_run_gold
consumer.stream_to_file(consumer_stream_path, consumer_file_path)
  File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in 
stream_to_file
self.run_shell_command(cmd)
  File "/arrow/dev/archery/archery/integration/tester.py", line 49, in 
run_shell_command
subprocess.check_call(cmd, shell=True)
  File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in 
check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 
'/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest
 --mode stream-to-file -a 
/tmp/tmpnhaxwkhj/d72f0c1c_0.14.1_datetime.gold.consumer_stream_as_file < 
/arrow/testing/data/arrow-ipc-stream/integration/0.14.1/generated_datetime.stream'
 returned non-zero exit status 1. {code}
It is possible that this is something to do with how we are running the archery 
tests, but I am at a loss as to how to debug this issue and would appreciate 
some input.

I think it started around when 
[this|https://github.com/apache/arrow/pull/13279] was merged

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16977) [R] Update dataset row counting so no integer overflow on large datasets

2022-07-05 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16977:


 Summary: [R] Update dataset row counting so no integer overflow on 
large datasets
 Key: ARROW-16977
 URL: https://issues.apache.org/jira/browse/ARROW-16977
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16976) [R] Build linux binaries on older image (like manylinux2014)

2022-07-05 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-16976:
---

 Summary: [R] Build linux binaries on older image (like 
manylinux2014)
 Key: ARROW-16976
 URL: https://issues.apache.org/jira/browse/ARROW-16976
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, R
Reporter: Neal Richardson


ARROW-16752 observed that even with newer compilers installed on centos 7, you 
can't use binaries built on ubuntu 18.04 because ubuntu 18.04 has glibc 2.27 
but centos 7 only has 2.17. But if we built the binaries on centos 7 with 
devtoolset-7 or 8 or something, all features could compile and we'd work with 
older glibc. 

Things built against older glibc are guaranteed to work with newer versions, 
and you can't just upgrade glibc because it would break the system. So for 
maximum compatibility, build with the oldest glibc. This strategy is like how 
they python manylinux standards work (IIUC). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)