[GitHub] [arrow-julia] sl-solution opened a new issue #280: Allow missing type without converting to vector

2022-01-26 Thread GitBox


sl-solution opened a new issue #280:
URL: https://github.com/apache/arrow-julia/issues/280


Not sure if it makes sense, but, would it be possible to allow missing type 
without copying the underlining arrow vector? As far as I understand, allowing 
missing only changes the `Type` of arrow vector (e.g. in `Primitive`) not the 
underlying data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-01-26 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-15474:
---

 Summary: [Python] Possibility of a table.drop_duplicates() 
function?
 Key: ARROW-15474
 URL: https://issues.apache.org/jira/browse/ARROW-15474
 Project: Apache Arrow
  Issue Type: Wish
Affects Versions: 6.0.1
Reporter: Lance Dacey
 Fix For: 8.0.0


I noticed that there is a group_by() and sort_by() function in the 7.0.0 
branch. Is it possible to include a drop_duplicates() function as well? 

||id||updated_at||
|1|2022-01-01 04:23:57|
|2|2022-01-01 07:19:21|
|2|2022-01-10 22:14:01|

Something like this which would return a table without the second row in the 
example above would be great. 

I usually am reading an append-only dataset and then I need to report on latest 
version of each row. To drop duplicates, I am temporarily converting the 
append-only table to a pandas DataFrame, and then I convert it back to a table 
and save a separate "latest-version" dataset.

{code:python}
table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
"ascending")]).drop_duplicates(subset=["id"] keep="last")
{code}








--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15473) [C++][FlightRPC] Expose a way to terminate DoExchange stream client side

2022-01-26 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-15473:
--

 Summary: [C++][FlightRPC] Expose a way to terminate DoExchange 
stream client side
 Key: ARROW-15473
 URL: https://issues.apache.org/jira/browse/ARROW-15473
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, FlightRPC
Reporter: Rok Mihevc


We want a mechanism to close DoExchange streams from client side in case of 
long running connections. This would be handy for testing and in case e.g. user 
wants to disconnect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15472) [Website] Add Flight SQL blog post

2022-01-26 Thread David Li (Jira)
David Li created ARROW-15472:


 Summary: [Website] Add Flight SQL blog post
 Key: ARROW-15472
 URL: https://issues.apache.org/jira/browse/ARROW-15472
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: David Li


To go along with/right after the 7.0.0 release announcement.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15471) [R] ExtensionType support in R

2022-01-26 Thread Dewey Dunnington (Jira)
Dewey Dunnington created ARROW-15471:


 Summary: [R] ExtensionType support in R
 Key: ARROW-15471
 URL: https://issues.apache.org/jira/browse/ARROW-15471
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dewey Dunnington


In Python there is support for extension types that consists of a registration 
step that defines functions to handle metadata serialization and 
deserialization. In R, any extension name or metadata at the top level is 
currently obliterated on import. To implement geometry reading and writing to 
Parquet, IPC, and/or Feather, we will need to at the very least have the 
extension name and metadata preserved (in R), and at best provide a 
registration step to customize the behaviour of the resulting Array/DataType.

Reprex for R:

{code:R}
# remotes::install_github("paleolimbot/narrow")
library(narrow)

carray <- as_narrow_array(1:5)

carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!"
carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas"
carray$schema$metadata[["something else"]] <- "more bananas"

array <- from_narrow_array(carray, arrow::Array)
carray2 <- as_narrow_array(array)

carray2$schema$metadata[["ARROW:extension:name"]]
#> NULL
carray2$schema$metadata[["ARROW:extension:metadata"]]
#> NULL
carray2$schema$metadata[["something else"]]
#> NULL
{code}


There is some discussion of that as a solution to ARROW-14378, including an 
example of how pandas implements the 'interval' extension type (example 
contributed by [~jorisvandenbossche]).

For the Interval example, there are some different parts living in different 
places:

- The Arrow Extension Type definition for pandas' interval type: 
https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136
- The __from_arrow__ implementation (doing the conversion to arrow): 
https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455
- The __from_arrow__ implementation (conversion arrow -> pandas): 
https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15470) [C++] Allows user to specify string to be used for missing data when writing CSV dataset

2022-01-26 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15470:


 Summary: [C++] Allows user to specify string to be used for 
missing data when writing CSV dataset
 Key: ARROW-15470
 URL: https://issues.apache.org/jira/browse/ARROW-15470
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


The ability to select the string to be used for missing data was implemented 
for the CSV Writer in ARROW-14903 but would it be possible to also allow this 
when writing CSV datasets?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15469) Unable to build pyarrow wheels with manylinux2014 for ppc64le arch

2022-01-26 Thread Marvin Giessing (Jira)
Marvin Giessing created ARROW-15469:
---

 Summary: Unable to build pyarrow wheels with manylinux2014 for 
ppc64le arch
 Key: ARROW-15469
 URL: https://issues.apache.org/jira/browse/ARROW-15469
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Marvin Giessing


Hi, I'm trying to build wheels for ppc64le with manylinux2014 following the 
[documentation|https://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos]
 but when I'm executing the cmake command I'm getting this issue:

 

```

[...]
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_PARQUET=ON \
-DARROW_PYTHON=ON \
-DARROW_BUILD_TESTS=ON \
-DPython3_EXECUTABLE=/opt/python/cp37-cp37m/bin/python3 \
..
 
[...]
-- Creating bundled static library target arrow_bundled_dependencies at 
/repos/arrow/cpp/build/release/libarrow_bundled_dependencies.a

CMake Error at 
/opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230
 (message):

  Could NOT find Python3 (missing: Development NumPy Development.Module

  Development.Embed) (found version "3.7.12")

Call Stack (most recent call first):

  
/opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594
 (_FPHSA_FAILURE_MESSAGE)

  
/opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPython/Support.cmake:3166
 (find_package_handle_standard_args)

  
/opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPython3.cmake:490
 (include)

  cmake_modules/FindPython3Alt.cmake:46 (find_package)

  src/arrow/python/CMakeLists.txt:22 (find_package)

```

 

Anyone knows what is going wrong here? I installed numpy via the requirements 
files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15468) [R] [CI] A crossbow job that tests against DuckDB's dev branch

2022-01-26 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-15468:
--

 Summary: [R] [CI] A crossbow job that tests against DuckDB's dev 
branch
 Key: ARROW-15468
 URL: https://issues.apache.org/jira/browse/ARROW-15468
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane


It would be good to test against DuckDB's dev branch to warn us if there are 
impending changes that break something.

While we're doing this, we should clean up some of the Currently some of our 
jobs do already 
https://github.com/apache/arrow/blob/f9f6fdbb7518c09b833cb6b78bc202008d28e865/ci/scripts/r_deps.sh#L45-L51
 

We should clean this up so that _generally_ builds use the released DuckDB, but 
we can toggle dev DuckDB (and run a separate build that uses the dev DuckDB 
optionally)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15467) [Go][Parquet] pqarrow decimal Test fails on s390x

2022-01-26 Thread Matthew Topol (Jira)
Matthew Topol created ARROW-15467:
-

 Summary: [Go][Parquet] pqarrow decimal Test fails on s390x
 Key: ARROW-15467
 URL: https://issues.apache.org/jira/browse/ARROW-15467
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Reporter: Matthew Topol
Assignee: Matthew Topol
 Fix For: 8.0.0


Faulty random decimal generation on BigEndian causing tests to fail.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15466) [Go] Please tag versions for Go modules to recognize

2022-01-26 Thread Jonathan A Sternberg (Jira)
Jonathan A Sternberg created ARROW-15466:


 Summary: [Go] Please tag versions for Go modules to recognize
 Key: ARROW-15466
 URL: https://issues.apache.org/jira/browse/ARROW-15466
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jonathan A Sternberg


Please tag v7 of arrow for Go with the method that Go expects for modules to 
specify their versions. At the current moment, if you want to upgrade to v7, 
you have to give a specific hash or a specific tag as part of the `go get` 
command instead of doing `go get github.com/apache/arrow/go/arrow/v7@latest`. 
This is because there is no `go/v7.0.0` tag pointing at the commit.

There is a `go/v6.0.1`. This request is to tag the versions in v7 with the same 
tag format alongside the `apache-arrow-7.0.0` tag.

See this page to see an example of the tag being recognized by Go modules 
properly: [https://pkg.go.dev/github.com/apache/arrow/go/v6.] If I replace that 
with `v7`, it does not currently recognize a stable version: 
[https://pkg.go.dev/github.com/apache/arrow/go/v7].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15465) [Python][CI] Dataset tests when Parquet is disabled

2022-01-26 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15465:
--

 Summary: [Python][CI] Dataset tests when Parquet is disabled
 Key: ARROW-15465
 URL: https://issues.apache.org/jira/browse/ARROW-15465
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Antoine Pitrou
 Fix For: 8.0.0


Example build at 
https://app.travis-ci.com/github/apache/arrow/jobs/557089817#L7819




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15464) [Python] CSV cancellation test flaky on macOS ARM64

2022-01-26 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15464:
--

 Summary: [Python] CSV cancellation test flaky on macOS ARM64
 Key: ARROW-15464
 URL: https://issues.apache.org/jira/browse/ARROW-15464
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Antoine Pitrou
 Fix For: 8.0.0


See for example this build where the test was un-skipped on Apple M1 hardware:
https://github.com/ursacomputing/crossbow/runs/4943189166?check_suite_focus=true

{code}
test-arm64-env/lib/python3.8/site-packages/pyarrow/tests/test_csv.py ... [ 21%]
arrow/ci/scripts/python_wheel_unix_test.sh: line 84: 73197 Killed: 9
   python -m pytest -r s --pyargs pyarrow
...
Error: Process completed with exit code 137.
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15463) [GLib] Add arrow::compute::Utf8NormalizeOptions bindings

2022-01-26 Thread Keisuke Okada (Jira)
Keisuke Okada created ARROW-15463:
-

 Summary: [GLib] Add arrow::compute::Utf8NormalizeOptions bindings
 Key: ARROW-15463
 URL: https://issues.apache.org/jira/browse/ARROW-15463
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib, Ruby
Reporter: Keisuke Okada






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15462) [GLib] Add GArrow{Month,DayTime,MonthDayNano}Scalar,Array,Arraybuilder

2022-01-26 Thread Keisuke Okada (Jira)
Keisuke Okada created ARROW-15462:
-

 Summary: [GLib] Add 
GArrow{Month,DayTime,MonthDayNano}Scalar,Array,Arraybuilder
 Key: ARROW-15462
 URL: https://issues.apache.org/jira/browse/ARROW-15462
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: GLib
Affects Versions: 8.0.0
Reporter: Keisuke Okada






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15461) [C++] arrow-utility-test fails with clang-12 (TestCopyAndReverseBitmapPreAllocated)

2022-01-26 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-15461:


 Summary: [C++] arrow-utility-test fails with clang-12 
(TestCopyAndReverseBitmapPreAllocated)
 Key: ARROW-15461
 URL: https://issues.apache.org/jira/browse/ARROW-15461
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yibo Cai


Unit test {{BitUtilTests.TestCopyAndReverseBitmapPreAllocated}} failed if 
release build arrow with clang-12, on both x86 and Arm.

Per my debug, it's related to {{GetReversedBlock}} function [1], when right 
shift a uint8 value by 8 bits.
I think it's a compiler bug. From the test code [2], clang-12 returns 1, which 
is wrong. clang-11 and clang-13 both return 2, the correct answer. Looks 
clang-12 over optimized the code, there should be no UB in the code (uint8 is 
promoted to integer before shift).

A workaround is to treat shifting 8 bits as a special case. Or we can simply 
ignore this error if the compiler bug is confirmed (I didn't find clang bug 
report).

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_ops.cc#L101
[2] https://godbolt.org/z/TzYWfcP1E



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15460) [R] Add as.data.frame.Dataset method

2022-01-26 Thread Jira
Dragoș Moldovan-Grünfeld created ARROW-15460:


 Summary: [R] Add as.data.frame.Dataset method
 Key: ARROW-15460
 URL: https://issues.apache.org/jira/browse/ARROW-15460
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Dragoș Moldovan-Grünfeld


Started with a question from Jim Hester on Twitter:

bq. Is there a way to take an arrow::Dataset and collect all the data into a 
data.frame without using `dplyr::collect()`?

bq. I have a code path I just want to return a regular data.frame, but I don't 
really want to add a soft dplyr dependency just for this.

Twitter thread: https://twitter.com/jimhester_/status/1484624519612579841?s=21

This might also be useful for pillar/tibble. Maybe add a 
{{max_memory_argument}} to avoid allocating to much memory. (see suggestion 
from Kirill Müller)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)