[jira] [Created] (ARROW-16149) [Python][FlightRPC] Expose UCX transport to Python

2022-04-07 Thread David Li (Jira)
David Li created ARROW-16149:


 Summary: [Python][FlightRPC] Expose UCX transport to Python
 Key: ARROW-16149
 URL: https://issues.apache.org/jira/browse/ARROW-16149
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Python
Reporter: David Li


The UCX transport lives in a separate shared library, which may complicate 
distribution (though for 8.0.0 we probably don't care about that yet).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16148) [C++] TPC-H generator cleanup

2022-04-07 Thread Weston Pace (Jira)
Weston Pace created ARROW-16148:
---

 Summary: [C++] TPC-H generator cleanup
 Key: ARROW-16148
 URL: https://issues.apache.org/jira/browse/ARROW-16148
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Weston Pace


An umbrella issue for a number of issues I've run into with our TPC-H generator.

h2. We emit fixed_size_binary fields with nuls padding the strings.
Ideally we would either emit these as utf8 strings like the others, or we would 
have a toggle to emit them as such (though see below about needing to strip 
nuls)

When I try and run these through the I get a number of seg faults or hangs when 
running a number of the TPC-H queries.

Additionally, even converting these to utf8|string types, I also need to strip 
out the nuls in order to actually query against them:

{code}
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
library(dplyr, warn.conflicts = FALSE)
options(arrow.skip_nul = TRUE)

tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
tab
#> Table
#> 25 rows x 4 columns
#> $N_NATIONKEY 
#> $N_NAME 
#> $N_REGIONKEY 
#> $N_COMMENT 

# This will not work (Though is how the TPC-H queries are structured)
tab %>% filter(N_NAME == "JAPAN") %>% collect()
#> # A tibble: 0 × 4
#> # … with 4 variables: N_NATIONKEY , N_NAME >,
#> #   N_REGIONKEY , N_COMMENT 

# Instead, we need to create the nul padded string to do the comparison
japan_raw <- as.raw(
  c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
0x00, 
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
)
# confirming this is the same thing as in the data 
japan_raw == as.vector(tab$N_NAME)[[13]]
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 
TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

tab %>% filter(N_NAME == Scalar$create(japan_raw, type = 
fixed_size_binary(25))) %>% collect()
#> # A tibble: 1 × 4
#>   N_NATIONKEY
#> 
#> 1  12
#> # … with 3 more variables: N_NAME >, N_REGIONKEY ,
#> #   N_COMMENT 
{code}

Here is the code I've been using to cast + strip these out after the fact:

{code}
library(arrow, warn.conflicts = FALSE)

options(arrow.skip_nul = TRUE)
options(arrow.use_altrep = FALSE)

tables <- arrowbench:::tpch_tables
  
for (table_name in tables) {
  message("Working on ", table_name)
  tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), 
as_data_frame=FALSE)
  
  for (col in tab$schema$fields) {
if (inherits(col$type, "FixedSizeBinary")) {
  message("Rewritting ", col$name)
  tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string(
}
  }
  
  tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile

2022-04-07 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16147:
--

 Summary: [C++] ParquetFileWriter doesn't call sink_.Close when 
using GcsRandomAccessFile
 Key: ARROW-16147
 URL: https://issues.apache.org/jira/browse/ARROW-16147
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Rok Mihevc






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16146) [C++] arrow-gcsfs-test is timing out

2022-04-07 Thread David Li (Jira)
David Li created ARROW-16146:


 Summary: [C++] arrow-gcsfs-test is timing out
 Key: ARROW-16146
 URL: https://issues.apache.org/jira/browse/ARROW-16146
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: David Li


{noformat}
The following tests FAILED:
101 - arrow-gcsfs-test (Timeout)
{noformat}

Appears to have started with [an unrelated minor 
PR|https://github.com/apache/arrow/commit/e047c9a6c9df565b86143036cc6bab26d3a59306].
 Observed on master and across several PRs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16145) [C++] Vector kernels should implement or reject null_handling = INTERSECTION

2022-04-07 Thread David Li (Jira)
David Li created ARROW-16145:


 Summary: [C++] Vector kernels should implement or reject 
null_handling = INTERSECTION
 Key: ARROW-16145
 URL: https://issues.apache.org/jira/browse/ARROW-16145
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: David Li


As discovered in ARROW-13530, right now the framework will let you register a 
vector kernel with null_handling = INTERSECTION, but doesn't actually implement 
that (it'll preallocate but won't compute the result). We should either 
implement it, or decide it makes no sense and explicitly reject registering 
kernels with this null handling mode.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16144) Write compressed data streams (particularly over S3)

2022-04-07 Thread Carl Boettiger (Jira)
Carl Boettiger created ARROW-16144:
--

 Summary: Write compressed data streams (particularly over S3)
 Key: ARROW-16144
 URL: https://issues.apache.org/jira/browse/ARROW-16144
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 7.0.0
Reporter: Carl Boettiger


The python bindings have `CompressedOutputStream`, but  I don't see how we can 
do this on the R side (e.g. with `write_csv_arrow()`).  It would be wonderful 
if we could both read and write compressed streams, particularly for CSV and 
particularly for remote filesystems, where this can provide considerable 
performance improvements.  

(For comparison, readr will write a compressed stream automatically based on 
the extension for the given filename, e.g. `readr::write_csv(data, 
"file.csv.gz")` or `write_csv("data.file.xz")`  )



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16143) Request to upgrade the version of java dependency "jackson"

2022-04-07 Thread Hui Yu (Jira)
Hui Yu created ARROW-16143:
--

 Summary: Request to upgrade the version of java dependency 
"jackson"
 Key: ARROW-16143
 URL: https://issues.apache.org/jira/browse/ARROW-16143
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 7.0.0
Reporter: Hui Yu
 Fix For: 7.0.1, 8.0.0, 9.0.0


CVE-2020-36518 (https://github.com/advisories/GHSA-57j2-w4cx-62h2) reports a 
security vulnerability for *jackson-databind*

Now the version of jackson for the master branch of Arrow is {*}2.11.4{*}, that 
is not safe.

Can you upgrade the version of this depenency ?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16142) [C++] Temporal floor/ceil/round returns incorrect results for date32 and time32 inputs

2022-04-07 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16142:
--

 Summary: [C++] Temporal floor/ceil/round returns incorrect results 
for date32 and time32 inputs
 Key: ARROW-16142
 URL: https://issues.apache.org/jira/browse/ARROW-16142
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Rok Mihevc


Temporal rounding flooring seem to interpret 32 bit input arrays as 64 bit 
arrays. The following test:
{code:c++}
TEST_F(ScalarTemporalTest, TestCeilFloorRoundTemporalDate) {
  RoundTemporalOptions round_to_2_hours = RoundTemporalOptions(2, 
CalendarUnit::HOUR);
  const char* date32s = R"([0, 11016, -25932, null])";
  const char* date64s = R"([0, 95178240, -224052480, null])";
  auto dates32 = ArrayFromJSON(date32(), date32s);
  auto dates64 = ArrayFromJSON(date64(), date64s);
  CheckScalarUnary("ceil_temporal", dates64, dates64, _to_2_hours);
  CheckScalarUnary("floor_temporal", dates64, dates64, _to_2_hours);
  CheckScalarUnary("round_temporal", dates64, dates64, _to_2_hours);

  CheckScalarUnary("ceil_temporal", dates32, dates32, _to_2_hours);
  CheckScalarUnary("floor_temporal", dates32, dates32, _to_2_hours);
  CheckScalarUnary("round_temporal", dates32, dates32, _to_2_hours);

  const char* times_s = R"([0, 7200, null])";
  const char* times_ms = R"([0, 720, null])";
  const char* times_us = R"([0, 72, null])";
  const char* times_ns = R"([0, 72000, null])";

  auto arr_s = ArrayFromJSON(time32(TimeUnit::SECOND), times_s);
  auto arr_ms = ArrayFromJSON(time32(TimeUnit::MILLI), times_ms);
  auto arr_us = ArrayFromJSON(time64(TimeUnit::MICRO), times_us);
  auto arr_ns = ArrayFromJSON(time64(TimeUnit::NANO), times_ns);

  CheckScalarUnary("ceil_temporal", arr_s, arr_s, _to_2_hours);
  CheckScalarUnary("ceil_temporal", arr_ms, arr_ms, _to_2_hours);
  CheckScalarUnary("ceil_temporal", arr_us, arr_us, _to_2_hours);
  CheckScalarUnary("ceil_temporal", arr_ns, arr_ns, _to_2_hours);
}
{code}

Returns:
{code:bash}
Got:
  [
[
  1970-01-01,
  1970-01-01,
  2000-02-29,
  null
]
  ]
Expected:
  [
[
  1970-01-01
],
[
  2000-02-29,
  1899-01-01,
  null
]
  ]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16141) [R] Update rhub/fedora-clang-devel for upstreamed changes

2022-04-07 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-16141:


 Summary: [R] Update rhub/fedora-clang-devel for upstreamed changes
 Key: ARROW-16141
 URL: https://issues.apache.org/jira/browse/ARROW-16141
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Dewey Dunnington


In ARROW-15857 we fixed the nightly failures on rhub/fedora-clang-devel by a 
kludge modifying the default makefile, but also upstreamed the fixes 
(https://github.com/rstudio/sass/pull/104 and 
https://github.com/r-hub/rhub-linux-builders/pull/60). These upstreams are now 
both released, so we can remove the kludge from modification of the docker 
image.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16140) [Python] zoneinfo timezones failing during type inference

2022-04-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-16140:
-

 Summary: [Python] zoneinfo timezones failing during type inference
 Key: ARROW-16140
 URL: https://issues.apache.org/jira/browse/ARROW-16140
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The conversion itself works fine (eg when specifying {{type=pa.timestamp("us", 
tz="America/New_York")}} in the below example), but inferring the type and 
timezone from the first value fails if it has a zoneinfo timezone:

{code}
In [53]: tz = zoneinfo.ZoneInfo(key='America/New_York')

In [54]: dt = datetime.datetime(2013, 11, 3, 10, 3, 14, tzinfo = tz)

In [55]: pa.array([dt])

ArrowInvalid: Object returned by tzinfo.utcoffset(None) is not an instance of 
datetime.timedelta
{code}

cc [~alenkaf]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16139) [Python] Crash in tests/test_dataset.py::test_write_dataset_s3

2022-04-07 Thread Alessandro Molina (Jira)
Alessandro Molina created ARROW-16139:
-

 Summary: [Python] Crash in 
tests/test_dataset.py::test_write_dataset_s3 
 Key: ARROW-16139
 URL: https://issues.apache.org/jira/browse/ARROW-16139
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 7.0.0
Reporter: Alessandro Molina
 Fix For: 8.0.0


{code:java}
Fatal Python error: Segmentation fault
1328
1329Thread 0x000117170e00 (most recent call first):
1330  File "/usr/local/lib/python3.9/site-packages/pyarrow/dataset.py", line 
927 in write_dataset
1331  File 
"/usr/local/lib/python3.9/site-packages/pyarrow/tests/test_dataset.py", line 
4265 in test_write_dataset_s3
1332  File "/usr/local/lib/python3.9/site-packages/_pytest/python.py", line 192 
in pytest_pyfunc_call
1333  File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 
in _multicall
1334  File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 
in _hookexec
1335  File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 
in __call__
1336  File "/usr/local/lib/python3.9/site-packages/_pytest/python.py", line 
1761 in runtest
1337  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 166 
in pytest_runtest_call
1338  File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 
in _multicall
1339  File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 
in _hookexec
1340  File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 
in __call__
1341  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 259 
in 
1342  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 338 
in from_call
1343  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 258 
in call_runtest_hook
1344  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 219 
in call_and_report
1345  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 130 
in runtestprotocol
1346  File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 111 
in pytest_runtest_protocol
1347  File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 
in _multicall
1348  File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 
in _hookexec
1349  File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 
in __call__
1350  File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 347 
in pytest_runtestloop
1351  File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 
in _multicall
1352  File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 
in _hookexec
1353  File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 
in __call__
1354  File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 322 
in _main
1355  File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 268 
in wrap_session
1356  File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 315 
in pytest_cmdline_main
1357  File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 
in _multicall
1358  File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 
in _hookexec
1359  File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 
in __call__
1360  File "/usr/local/lib/python3.9/site-packages/_pytest/config/__init__.py", 
line 164 in main
1361  File "/usr/local/lib/python3.9/site-packages/_pytest/config/__init__.py", 
line 187 in console_main
1362  File "/usr/local/bin/pytest", line 8 in 
1363ci/scripts/python_test.sh: line 55: 20279 Segmentation fault: 11  pytest -r 
s -v ${PYTEST_ARGS} --pyargs pyarrow
1364tests/test_dataset.py::test_write_dataset_s3  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16138) [C++] Improve performance of ExecuteScalarExpression

2022-04-07 Thread Weston Pace (Jira)
Weston Pace created ARROW-16138:
---

 Summary: [C++] Improve performance of ExecuteScalarExpression
 Key: ARROW-16138
 URL: https://issues.apache.org/jira/browse/ARROW-16138
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


One of the things we want to be able to do in the streaming execution engine is 
process data in small L2 sized batches.  Based on literature we might like to 
use batches somewhere in the range of 1k to 16k rows.  In ARROW-16014 we 
created a benchmark to measure the performance of ExecuteScalarExpression as 
the size of our batches got smaller.  There are two things we observed:

 * Something is causing thread contention.  We should be able to get pretty 
close to perfect linear speedup when we are evaluating scalar expressions and 
the batch size fits entirely into L2.  We are not seeing that.
 * The overhead of ExecuteScalarExpression is too high when processing small 
batches.  Even when the expression is doing real work (e.g. copies, 
comparisons) the execution time starts to be dominated by overhead when we have 
10k sized batches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)