[jira] [Created] (ARROW-16149) [Python][FlightRPC] Expose UCX transport to Python
David Li created ARROW-16149: Summary: [Python][FlightRPC] Expose UCX transport to Python Key: ARROW-16149 URL: https://issues.apache.org/jira/browse/ARROW-16149 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Python Reporter: David Li The UCX transport lives in a separate shared library, which may complicate distribution (though for 8.0.0 we probably don't care about that yet). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16148) [C++] TPC-H generator cleanup
Weston Pace created ARROW-16148: --- Summary: [C++] TPC-H generator cleanup Key: ARROW-16148 URL: https://issues.apache.org/jira/browse/ARROW-16148 Project: Apache Arrow Issue Type: Bug Reporter: Weston Pace An umbrella issue for a number of issues I've run into with our TPC-H generator. h2. We emit fixed_size_binary fields with nuls padding the strings. Ideally we would either emit these as utf8 strings like the others, or we would have a toggle to emit them as such (though see below about needing to strip nuls) When I try and run these through the I get a number of seg faults or hangs when running a number of the TPC-H queries. Additionally, even converting these to utf8|string types, I also need to strip out the nuls in order to actually query against them: {code} library(arrow, warn.conflicts = FALSE) #> See arrow_info() for available features library(dplyr, warn.conflicts = FALSE) options(arrow.skip_nul = TRUE) tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE) tab #> Table #> 25 rows x 4 columns #> $N_NATIONKEY #> $N_NAME #> $N_REGIONKEY #> $N_COMMENT # This will not work (Though is how the TPC-H queries are structured) tab %>% filter(N_NAME == "JAPAN") %>% collect() #> # A tibble: 0 × 4 #> # … with 4 variables: N_NATIONKEY , N_NAME >, #> # N_REGIONKEY , N_COMMENT # Instead, we need to create the nul padded string to do the comparison japan_raw <- as.raw( c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00) ) # confirming this is the same thing as in the data japan_raw == as.vector(tab$N_NAME)[[13]] #> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE #> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE tab %>% filter(N_NAME == Scalar$create(japan_raw, type = fixed_size_binary(25))) %>% collect() #> # A tibble: 1 × 4 #> N_NATIONKEY #> #> 1 12 #> # … with 3 more variables: N_NAME >, N_REGIONKEY , #> # N_COMMENT {code} Here is the code I've been using to cast + strip these out after the fact: {code} library(arrow, warn.conflicts = FALSE) options(arrow.skip_nul = TRUE) options(arrow.use_altrep = FALSE) tables <- arrowbench:::tpch_tables for (table_name in tables) { message("Working on ", table_name) tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), as_data_frame=FALSE) for (col in tab$schema$fields) { if (inherits(col$type, "FixedSizeBinary")) { message("Rewritting ", col$name) tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string( } } tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet")) } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile
Rok Mihevc created ARROW-16147: -- Summary: [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile Key: ARROW-16147 URL: https://issues.apache.org/jira/browse/ARROW-16147 Project: Apache Arrow Issue Type: Bug Reporter: Rok Mihevc -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16146) [C++] arrow-gcsfs-test is timing out
David Li created ARROW-16146: Summary: [C++] arrow-gcsfs-test is timing out Key: ARROW-16146 URL: https://issues.apache.org/jira/browse/ARROW-16146 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li {noformat} The following tests FAILED: 101 - arrow-gcsfs-test (Timeout) {noformat} Appears to have started with [an unrelated minor PR|https://github.com/apache/arrow/commit/e047c9a6c9df565b86143036cc6bab26d3a59306]. Observed on master and across several PRs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16145) [C++] Vector kernels should implement or reject null_handling = INTERSECTION
David Li created ARROW-16145: Summary: [C++] Vector kernels should implement or reject null_handling = INTERSECTION Key: ARROW-16145 URL: https://issues.apache.org/jira/browse/ARROW-16145 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: David Li As discovered in ARROW-13530, right now the framework will let you register a vector kernel with null_handling = INTERSECTION, but doesn't actually implement that (it'll preallocate but won't compute the result). We should either implement it, or decide it makes no sense and explicitly reject registering kernels with this null handling mode. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16144) Write compressed data streams (particularly over S3)
Carl Boettiger created ARROW-16144: -- Summary: Write compressed data streams (particularly over S3) Key: ARROW-16144 URL: https://issues.apache.org/jira/browse/ARROW-16144 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 7.0.0 Reporter: Carl Boettiger The python bindings have `CompressedOutputStream`, but I don't see how we can do this on the R side (e.g. with `write_csv_arrow()`). It would be wonderful if we could both read and write compressed streams, particularly for CSV and particularly for remote filesystems, where this can provide considerable performance improvements. (For comparison, readr will write a compressed stream automatically based on the extension for the given filename, e.g. `readr::write_csv(data, "file.csv.gz")` or `write_csv("data.file.xz")` ) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16143) Request to upgrade the version of java dependency "jackson"
Hui Yu created ARROW-16143: -- Summary: Request to upgrade the version of java dependency "jackson" Key: ARROW-16143 URL: https://issues.apache.org/jira/browse/ARROW-16143 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 7.0.0 Reporter: Hui Yu Fix For: 7.0.1, 8.0.0, 9.0.0 CVE-2020-36518 (https://github.com/advisories/GHSA-57j2-w4cx-62h2) reports a security vulnerability for *jackson-databind* Now the version of jackson for the master branch of Arrow is {*}2.11.4{*}, that is not safe. Can you upgrade the version of this depenency ? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16142) [C++] Temporal floor/ceil/round returns incorrect results for date32 and time32 inputs
Rok Mihevc created ARROW-16142: -- Summary: [C++] Temporal floor/ceil/round returns incorrect results for date32 and time32 inputs Key: ARROW-16142 URL: https://issues.apache.org/jira/browse/ARROW-16142 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Rok Mihevc Temporal rounding flooring seem to interpret 32 bit input arrays as 64 bit arrays. The following test: {code:c++} TEST_F(ScalarTemporalTest, TestCeilFloorRoundTemporalDate) { RoundTemporalOptions round_to_2_hours = RoundTemporalOptions(2, CalendarUnit::HOUR); const char* date32s = R"([0, 11016, -25932, null])"; const char* date64s = R"([0, 95178240, -224052480, null])"; auto dates32 = ArrayFromJSON(date32(), date32s); auto dates64 = ArrayFromJSON(date64(), date64s); CheckScalarUnary("ceil_temporal", dates64, dates64, _to_2_hours); CheckScalarUnary("floor_temporal", dates64, dates64, _to_2_hours); CheckScalarUnary("round_temporal", dates64, dates64, _to_2_hours); CheckScalarUnary("ceil_temporal", dates32, dates32, _to_2_hours); CheckScalarUnary("floor_temporal", dates32, dates32, _to_2_hours); CheckScalarUnary("round_temporal", dates32, dates32, _to_2_hours); const char* times_s = R"([0, 7200, null])"; const char* times_ms = R"([0, 720, null])"; const char* times_us = R"([0, 72, null])"; const char* times_ns = R"([0, 72000, null])"; auto arr_s = ArrayFromJSON(time32(TimeUnit::SECOND), times_s); auto arr_ms = ArrayFromJSON(time32(TimeUnit::MILLI), times_ms); auto arr_us = ArrayFromJSON(time64(TimeUnit::MICRO), times_us); auto arr_ns = ArrayFromJSON(time64(TimeUnit::NANO), times_ns); CheckScalarUnary("ceil_temporal", arr_s, arr_s, _to_2_hours); CheckScalarUnary("ceil_temporal", arr_ms, arr_ms, _to_2_hours); CheckScalarUnary("ceil_temporal", arr_us, arr_us, _to_2_hours); CheckScalarUnary("ceil_temporal", arr_ns, arr_ns, _to_2_hours); } {code} Returns: {code:bash} Got: [ [ 1970-01-01, 1970-01-01, 2000-02-29, null ] ] Expected: [ [ 1970-01-01 ], [ 2000-02-29, 1899-01-01, null ] ] {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16141) [R] Update rhub/fedora-clang-devel for upstreamed changes
Dewey Dunnington created ARROW-16141: Summary: [R] Update rhub/fedora-clang-devel for upstreamed changes Key: ARROW-16141 URL: https://issues.apache.org/jira/browse/ARROW-16141 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Dewey Dunnington In ARROW-15857 we fixed the nightly failures on rhub/fedora-clang-devel by a kludge modifying the default makefile, but also upstreamed the fixes (https://github.com/rstudio/sass/pull/104 and https://github.com/r-hub/rhub-linux-builders/pull/60). These upstreams are now both released, so we can remove the kludge from modification of the docker image. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16140) [Python] zoneinfo timezones failing during type inference
Joris Van den Bossche created ARROW-16140: - Summary: [Python] zoneinfo timezones failing during type inference Key: ARROW-16140 URL: https://issues.apache.org/jira/browse/ARROW-16140 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The conversion itself works fine (eg when specifying {{type=pa.timestamp("us", tz="America/New_York")}} in the below example), but inferring the type and timezone from the first value fails if it has a zoneinfo timezone: {code} In [53]: tz = zoneinfo.ZoneInfo(key='America/New_York') In [54]: dt = datetime.datetime(2013, 11, 3, 10, 3, 14, tzinfo = tz) In [55]: pa.array([dt]) ArrowInvalid: Object returned by tzinfo.utcoffset(None) is not an instance of datetime.timedelta {code} cc [~alenkaf] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16139) [Python] Crash in tests/test_dataset.py::test_write_dataset_s3
Alessandro Molina created ARROW-16139: - Summary: [Python] Crash in tests/test_dataset.py::test_write_dataset_s3 Key: ARROW-16139 URL: https://issues.apache.org/jira/browse/ARROW-16139 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 7.0.0 Reporter: Alessandro Molina Fix For: 8.0.0 {code:java} Fatal Python error: Segmentation fault 1328 1329Thread 0x000117170e00 (most recent call first): 1330 File "/usr/local/lib/python3.9/site-packages/pyarrow/dataset.py", line 927 in write_dataset 1331 File "/usr/local/lib/python3.9/site-packages/pyarrow/tests/test_dataset.py", line 4265 in test_write_dataset_s3 1332 File "/usr/local/lib/python3.9/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call 1333 File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall 1334 File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec 1335 File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ 1336 File "/usr/local/lib/python3.9/site-packages/_pytest/python.py", line 1761 in runtest 1337 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call 1338 File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall 1339 File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec 1340 File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ 1341 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 259 in 1342 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 338 in from_call 1343 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 258 in call_runtest_hook 1344 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 219 in call_and_report 1345 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 130 in runtestprotocol 1346 File "/usr/local/lib/python3.9/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol 1347 File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall 1348 File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec 1349 File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ 1350 File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 347 in pytest_runtestloop 1351 File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall 1352 File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec 1353 File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ 1354 File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 322 in _main 1355 File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 268 in wrap_session 1356 File "/usr/local/lib/python3.9/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main 1357 File "/usr/local/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in _multicall 1358 File "/usr/local/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in _hookexec 1359 File "/usr/local/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__ 1360 File "/usr/local/lib/python3.9/site-packages/_pytest/config/__init__.py", line 164 in main 1361 File "/usr/local/lib/python3.9/site-packages/_pytest/config/__init__.py", line 187 in console_main 1362 File "/usr/local/bin/pytest", line 8 in 1363ci/scripts/python_test.sh: line 55: 20279 Segmentation fault: 11 pytest -r s -v ${PYTEST_ARGS} --pyargs pyarrow 1364tests/test_dataset.py::test_write_dataset_s3 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16138) [C++] Improve performance of ExecuteScalarExpression
Weston Pace created ARROW-16138: --- Summary: [C++] Improve performance of ExecuteScalarExpression Key: ARROW-16138 URL: https://issues.apache.org/jira/browse/ARROW-16138 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace One of the things we want to be able to do in the streaming execution engine is process data in small L2 sized batches. Based on literature we might like to use batches somewhere in the range of 1k to 16k rows. In ARROW-16014 we created a benchmark to measure the performance of ExecuteScalarExpression as the size of our batches got smaller. There are two things we observed: * Something is causing thread contention. We should be able to get pretty close to perfect linear speedup when we are evaluating scalar expressions and the batch size fits entirely into L2. We are not seeing that. * The overhead of ExecuteScalarExpression is too high when processing small batches. Even when the expression is doing real work (e.g. copies, comparisons) the execution time starts to be dominated by overhead when we have 10k sized batches. -- This message was sent by Atlassian Jira (v8.20.1#820001)