[jira] [Created] (ARROW-15670) [C++/Python/Packaging] Update conda pinnings and enable GCS on Windows
Uwe Korn created ARROW-15670: Summary: [C++/Python/Packaging] Update conda pinnings and enable GCS on Windows Key: ARROW-15670 URL: https://issues.apache.org/jira/browse/ARROW-15670 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15445) [C++/Python] pyarrow build incorrectly detects x86 as system process during cross-cimpile
Uwe Korn created ARROW-15445: Summary: [C++/Python] pyarrow build incorrectly detects x86 as system process during cross-cimpile Key: ARROW-15445 URL: https://issues.apache.org/jira/browse/ARROW-15445 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Uwe Korn When cross-compiling {{pyarrow}} for aarch64 or ppc64le we run into the following issue: {code:java} -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Failed -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Failed -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Failed -- Arrow build warning level: PRODUCTION CMake Error at cmake_modules/SetupCxxFlags.cmake:456 (message): SSE4.2 required but compiler doesn't support it. Call Stack (most recent call first): CMakeLists.txt:121 (include) -- Configuring incomplete, errors occurred! {code} The error is valid as we are building for a target system that doesn't support SSE at all. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15444) [C++] Compilation with GCC 7.5 fails in aggregate_basic.cc
Uwe Korn created ARROW-15444: Summary: [C++] Compilation with GCC 7.5 fails in aggregate_basic.cc Key: ARROW-15444 URL: https://issues.apache.org/jira/browse/ARROW-15444 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Building with GCC 7.5 currently fails with the following internal error. We need to support this GCC version for CUDA-enabled and PPC64LE builds on conda-forge. See also the updated conda recipe in https://github.com/apache/arrow/pull/11916 {code:java} 2022-01-24T14:18:48.2261185Z [182/405] Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_basic.cc.o 2022-01-24T14:18:48.2261792Z FAILED: src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_basic.cc.o 2022-01-24T14:18:48.2268608Z /build/arrow-cpp-ext_1643033227908/_build_env/bin/powerpc64le-conda-linux-gnu-c++ -DARROW_EXPORTING -DARROW_HDFS -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_MIMALLOC -DARROW_WITH_BACKTRACE -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DURI_STATIC_BUILD -I/build/arrow-cpp-ext_1643033227908/work/cpp/build/src -I/build/arrow-cpp-ext_1643033227908/work/cpp/src -I/build/arrow-cpp-ext_1643033227908/work/cpp/src/generated -isystem /build/arrow-cpp-ext_1643033227908/work/cpp/thirdparty/flatbuffers/include -isystem /build/arrow-cpp-ext_1643033227908/work/cpp/build/jemalloc_ep-prefix/src -isystem /build/arrow-cpp-ext_1643033227908/work/cpp/build/mimalloc_ep/src/mimalloc_ep/include/mimalloc-1.7 -isystem /build/arrow-cpp-ext_1643033227908/work/cpp/build/xsimd_ep/src/xsimd_ep-install/include -isystem /build/arrow-cpp-ext_1643033227908/work/cpp/thirdparty/hadoop/include -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -mcpu=power8 -mtune=power8 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem /build/arrow-cpp-ext_1643033227908/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/include -fdebug-prefix-map=/build/arrow-cpp-ext_1643033227908/work=/usr/local/src/conda/arrow-cpp-7.0.0.dev553 -fdebug-prefix-map=/build/arrow-cpp-ext_1643033227908/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla=/usr/local/src/conda-prefix -fdiagnostics-color=always -fuse-ld=gold -O3 -DNDEBUG -Wall -fno-semantic-interposition -O3 -DNDEBUG -fPIC -std=c++1z -MD -MT src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_basic.cc.o -MF src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_basic.cc.o.d -o src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/aggregate_basic.cc.o -c /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/aggregate_basic.cc 2022-01-24T14:18:48.2273037Z In file included from /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/codegen_internal.h:46:0, 2022-01-24T14:18:48.2273811Z from /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/util_internal.h:26, 2022-01-24T14:18:48.2274563Z from /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/aggregate_internal.h:20, 2022-01-24T14:18:48.2275318Z from /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/aggregate_basic_internal.h:24, 2022-01-24T14:18:48.2276088Z from /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/aggregate_basic.cc:19: 2022-01-24T14:18:48.2277993Z /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/aggregate_internal.h: In instantiation of 'arrow::compute::internal::SumArray(const arrow::ArrayData&, ValueFunc&&):: [with ValueType = double; SumType = double; arrow::compute::SimdLevel::type SimdLevel = (arrow::compute::SimdLevel::type)0; ValueFunc = arrow::compute::internal::SumArray(const arrow::ArrayData&) [with ValueType = double; SumType = double; arrow::compute::SimdLevel::type SimdLevel = (arrow::compute::SimdLevel::type)0]::]': 2022-01-24T14:18:48.2281061Z /build/arrow-cpp-ext_1643033227908/work/cpp/src/arrow/compute/kernels/aggregate_internal.h:181:5: required from 'struct arrow::compute::internal::SumArray(const arrow::ArrayData&, ValueFunc&&) [with ValueType = double; SumType = double; arrow::compute::SimdLevel::type SimdLevel = (arrow::compute::SimdLevel::type)0; ValueFunc = arrow::compute::internal::SumArray(const arrow::ArrayData&) [with ValueType = double; SumType =
[jira] [Created] (ARROW-13140) [C++/Python] Upgrade libthrift pin in the nightlies
Uwe Korn created ARROW-13140: Summary: [C++/Python] Upgrade libthrift pin in the nightlies Key: ARROW-13140 URL: https://issues.apache.org/jira/browse/ARROW-13140 Project: Apache Arrow Issue Type: Task Components: C++, Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12649) [Python/Packaging] Move conda-aarch64 to Azure with cross-compilation
Uwe Korn created ARROW-12649: Summary: [Python/Packaging] Move conda-aarch64 to Azure with cross-compilation Key: ARROW-12649 URL: https://issues.apache.org/jira/browse/ARROW-12649 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 5.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12420) [C++/Dataset] Reading null columns as dictionary not longer possible
Uwe Korn created ARROW-12420: Summary: [C++/Dataset] Reading null columns as dictionary not longer possible Key: ARROW-12420 URL: https://issues.apache.org/jira/browse/ARROW-12420 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 4.0.0 Reporter: Uwe Korn Fix For: 4.0.0 Reading a dataset with a dictionary column where some of the files don't contain any data for that column (and thus are typed as null) broke with https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release though and thus I would consider this a regression. This can be reproduced using the following Python snippet: {code} import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds table = pa.table({"a": [None, None]}) pq.write_table(table, "test.parquet") schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))]) fsds = ds.FileSystemDataset.from_paths( paths=["test.parquet"], schema=schema, format=pa.dataset.ParquetFileFormat(), filesystem=pa.fs.LocalFileSystem(), ) fsds.to_table() {code} The exception on master is currently: {code} --- ArrowNotImplementedError Traceback (most recent call last) in 6 filesystem=pa.fs.LocalFileSystem(), 7 ) > 8 fsds.to_table() ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table() 456 table : Table instance 457 """ --> 458 return self._scanner(**kwargs).to_table() 459 460 def head(self, int num_rows, **kwargs): ~/Development/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table() 2887 result = self.scanner.ToTable() 2888 -> 2889 return pyarrow_wrap_table(GetResultValue(result)) 2890 2891 def take(self, object indices): ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \ 140 nogil except -1: --> 141 return check_status(status) 142 143 ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() 116 raise ArrowKeyError(message) 117 elif status.IsNotImplemented(): --> 118 raise ArrowNotImplementedError(message) 119 elif status.IsTypeError(): 120 raise ArrowTypeError(message) ArrowNotImplementedError: Unsupported cast from null to dictionary (no available cast function for target type) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12230) [C++/Python/Packaging] Move conda aarch64 builds to Azure Pipelines
Uwe Korn created ARROW-12230: Summary: [C++/Python/Packaging] Move conda aarch64 builds to Azure Pipelines Key: ARROW-12230 URL: https://issues.apache.org/jira/browse/ARROW-12230 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging, Python Reporter: Uwe Korn We should move the nightly conda builds for aarch64 to Azure Pipelines as they currently fail on drone due to the hard 1h timeout. On Azure Pipelines, they should work automatically thanks to conda-forge's cross-compilation setup. The necessary trick here is that the {{.ci_support}} files contain a {{target_platform}} line. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11724) [C++] Namespace collisions with protobuf 3.15
Uwe Korn created ARROW-11724: Summary: [C++] Namespace collisions with protobuf 3.15 Key: ARROW-11724 URL: https://issues.apache.org/jira/browse/ARROW-11724 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Affects Versions: 3.0.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 4.0.0 We define {{pb}} as a namespace alias in the flight sources. This conflicts with {{protobuf}} starting to introduce it as its global namespace alias. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11372) Support RC verification on macOS-ARM64
Uwe Korn created ARROW-11372: Summary: Support RC verification on macOS-ARM64 Key: ARROW-11372 URL: https://issues.apache.org/jira/browse/ARROW-11372 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 3.0.0 There are some assumptions in the verification scripts that assume an x86 system. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11198) [Packaging][Python] Ensure setuptools version during build supports markdown
Uwe Korn created ARROW-11198: Summary: [Packaging][Python] Ensure setuptools version during build supports markdown Key: ARROW-11198 URL: https://issues.apache.org/jira/browse/ARROW-11198 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 3.0.0 We use a {{text/markdown}} long description and thus should always build/upload with at least setuptools 38.6. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11127) [C++] Unused cpu_info on non-x86 architecture
Uwe Korn created ARROW-11127: Summary: [C++] Unused cpu_info on non-x86 architecture Key: ARROW-11127 URL: https://issues.apache.org/jira/browse/ARROW-11127 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 2.0.0 Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10881) [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun
Uwe Korn created ARROW-10881: Summary: [C++] EXC_BAD_ACCESS in BaseSetBitRunReader::NextRun Key: ARROW-10881 URL: https://issues.apache.org/jira/browse/ARROW-10881 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 2.0.0 Reporter: Uwe Korn {{./release/parquet-encoding-benchmark}} fails with {code} BM_PlainDecodingFloat/65536 4206 ns 4206 ns 167354 bytes_per_second=58.0474G/s error: libparquet.300.dylib debug map object file '/Users/uwe/Development/arrow/cpp/build/src/parquet/CMakeFiles/parquet_objlib.dir/encoding.cc.o' has changed (actual time is 2020-12-10 22:57:29.0, debug map time is 2020-12-10 21:02:52.0) since this executable was linked, file will be ignored Process 11120 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x00010047fe04 libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 192 libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun: -> 0x10047fe04 <+192>: ldur x11, [x9, #-0x8] 0x10047fe08 <+196>: strx9, [x19] 0x10047fe0c <+200>: strx11, [x19, #0x18] 0x10047fe10 <+204>: rbit x10, x11 Target 0: (parquet-encoding-benchmark) stopped. (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) * frame #0: 0x00010047fe04 libparquet.300.dylib`arrow::internal::BaseSetBitRunReader::NextRun() + 192 frame #1: 0x00010047f808 libparquet.300.dylib`parquet::(anonymous namespace)::PlainEncoder >::PutSpaced(bool const*, int, unsigned char const*, long long) + 336 frame #2: 0x00018970 parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(benchmark::State&) at encoding_benchmark.cc:249:14 [opt] frame #3: 0x0001881c parquet-encoding-benchmark`parquet::BM_PlainEncodingSpacedBoolean(state=0x00016fdfd4b8) at encoding_benchmark.cc:257 [opt] frame #4: 0x0001001614f4 libbenchmark.0.dylib`benchmark::internal::BenchmarkInstance::Run(unsigned long long, int, benchmark::internal::ThreadTimer*, benchmark::internal::ThreadManager*) const + 68 frame #5: 0x000100173ae8 libbenchmark.0.dylib`benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::BenchmarkInstance const*, unsigned long long, int, benchmark::internal::ThreadManager*) + 80 frame #6: 0x0001001723c8 libbenchmark.0.dylib`benchmark::internal::RunBenchmark(benchmark::internal::BenchmarkInstance const&, std::__1::vector >*) + 1284 frame #7: 0x00010015ee7c libbenchmark.0.dylib`benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) + 1824 frame #8: 0x00010014beec libbenchmark_main.0.dylib`main + 76 frame #9: 0x00019e270f54 libdyld.dylib`start + 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10873) [C++] Apple Silicon is reported as arm64 in CMake
Uwe Korn created ARROW-10873: Summary: [C++] Apple Silicon is reported as arm64 in CMake Key: ARROW-10873 URL: https://issues.apache.org/jira/browse/ARROW-10873 Project: Apache Arrow Issue Type: Task Components: C++ Affects Versions: 2.0.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 2.0.1, 3.0.0 Currently we try to build with AVX2 on this platform which raises a lot of errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10861) [Python] Update minimal NumPy version to 1.6.6
Uwe Korn created ARROW-10861: Summary: [Python] Update minimal NumPy version to 1.6.6 Key: ARROW-10861 URL: https://issues.apache.org/jira/browse/ARROW-10861 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 2.0.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 2.0.1, 3.0.0 As part of the mitigation of https://github.com/numpy/numpy/issues/17913 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10853) [Java] Undeprecate sqlToArrow helpers
Uwe Korn created ARROW-10853: Summary: [Java] Undeprecate sqlToArrow helpers Key: ARROW-10853 URL: https://issues.apache.org/jira/browse/ARROW-10853 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 2.0.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 3.0.0 These helper functions are really useful when called from Python as they deal with a lot of "internals" of Java that we don't want to handle from the Python side. We rather would keep using these functions. Note that some of them are broken due to recent refactoring and only return 1024 rows (the default iterator size) without the ability to change that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10833) [Python] Avoid usage of NumPy's PyArray_DescrCheck macro
Uwe Korn created ARROW-10833: Summary: [Python] Avoid usage of NumPy's PyArray_DescrCheck macro Key: ARROW-10833 URL: https://issues.apache.org/jira/browse/ARROW-10833 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Reporter: Uwe Korn Fix For: 3.0.0, 2.0.1 This is faulty in old versions and this will lead to a lot of issues with the upcoming numpy 1.20 release. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10711) [CI] Remove set-env from auto-tune to work with new GHA settings
Uwe Korn created ARROW-10711: Summary: [CI] Remove set-env from auto-tune to work with new GHA settings Key: ARROW-10711 URL: https://issues.apache.org/jira/browse/ARROW-10711 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Developer Tools Reporter: Uwe Korn Assignee: Uwe Korn See https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10616) [Developer] Expand PR labeler to R and Python
Uwe Korn created ARROW-10616: Summary: [Developer] Expand PR labeler to R and Python Key: ARROW-10616 URL: https://issues.apache.org/jira/browse/ARROW-10616 Project: Apache Arrow Issue Type: Bug Components: Developer Tools Reporter: Uwe Korn Assignee: Uwe Korn This would help me to browse through past PRs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10509) [C++] Define operator<<(ostream, ParquetExceptio) for clang+Windows
Uwe Korn created ARROW-10509: Summary: [C++] Define operator<<(ostream, ParquetExceptio) for clang+Windows Key: ARROW-10509 URL: https://issues.apache.org/jira/browse/ARROW-10509 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10502) [C++/Python] CUDA detection messes up nightly conda-win builds
Uwe Korn created ARROW-10502: Summary: [C++/Python] CUDA detection messes up nightly conda-win builds Key: ARROW-10502 URL: https://issues.apache.org/jira/browse/ARROW-10502 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10346) [Python] Default S3 region is eu-central-1 even with LANG=C
Uwe Korn created ARROW-10346: Summary: [Python] Default S3 region is eu-central-1 even with LANG=C Key: ARROW-10346 URL: https://issues.apache.org/jira/browse/ARROW-10346 Project: Apache Arrow Issue Type: Bug Reporter: Uwe Korn Verifying the macOS wheels using {{LANG=C dev/release/verify-release-candidate.sh wheels 2.0.0 2}} fails for me with {code} @pytest.mark.s3 def test_s3_real_aws(): # Exercise connection code with an AWS-backed S3 bucket. # This is a minimal integration check for ARROW-9261 and similar issues. from pyarrow.fs import S3FileSystem fs = S3FileSystem(anonymous=True) > assert fs.region == 'us-east-1' # default region E AssertionError: assert 'eu-central-1' == 'us-east-1' E - us-east-1 E + eu-central-1 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10302) [Python] Don't double-package plasma-store-server
Uwe Korn created ARROW-10302: Summary: [Python] Don't double-package plasma-store-server Key: ARROW-10302 URL: https://issues.apache.org/jira/browse/ARROW-10302 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 3.0.0 This is part of the {{arrow-cpp}} and {{pyarrow}} conda packages. We shouldn't ship the version in {{pyarrow}} as this is just a copy to a different location. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10253) [Python] Don't bundle plasma-store-server in pyarrow conda package
Uwe Korn created ARROW-10253: Summary: [Python] Don't bundle plasma-store-server in pyarrow conda package Key: ARROW-10253 URL: https://issues.apache.org/jira/browse/ARROW-10253 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn We currently have it in the {{arrow-cpp}} and the {{pyarrow}} conda package, we should only have it in {{arrow-cpp}} as this is always there and also the source of the binary. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10252) [Python] Add option to skip inclusion of Arrow headers in Python installation
Uwe Korn created ARROW-10252: Summary: [Python] Add option to skip inclusion of Arrow headers in Python installation Key: ARROW-10252 URL: https://issues.apache.org/jira/browse/ARROW-10252 Project: Apache Arrow Issue Type: Improvement Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn We don't want to have them as part of the conda package as the single source should be {{arrow-cpp}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10104) [Python] Separate tests into its own conda package
Uwe Korn created ARROW-10104: Summary: [Python] Separate tests into its own conda package Key: ARROW-10104 URL: https://issues.apache.org/jira/browse/ARROW-10104 Project: Apache Arrow Issue Type: Bug Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn We currently ship the tests with the source code. This is nice to test the integrity of the installation, it is not needed for runtime though. In the case of conda, the overhead to turn them into a separate package is small. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10081) [C++/Python] Fix bash syntax in drone.io conda builds
Uwe Korn created ARROW-10081: Summary: [C++/Python] Fix bash syntax in drone.io conda builds Key: ARROW-10081 URL: https://issues.apache.org/jira/browse/ARROW-10081 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10049) [C++/Python] Sync conda recipe with conda-forge
Uwe Korn created ARROW-10049: Summary: [C++/Python] Sync conda recipe with conda-forge Key: ARROW-10049 URL: https://issues.apache.org/jira/browse/ARROW-10049 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10011) [C++] Make FindRE2.cmake re-entrant
Uwe Korn created ARROW-10011: Summary: [C++] Make FindRE2.cmake re-entrant Key: ARROW-10011 URL: https://issues.apache.org/jira/browse/ARROW-10011 Project: Apache Arrow Issue Type: Bug Components: C++, FlightRPC Affects Versions: 1.0.1, 1.0.0 Reporter: Uwe Korn Assignee: Uwe Korn Repeated calls to FindRE2.cmake try to recreate the exisiting target {{RE2::re2}} which is prohibited by CMake and fails with the following error: {code} CMake Warning (dev) at C:/Miniconda37-x64/envs/arrow/Library/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:272 (message): The package name passed to `find_package_handle_standard_args` (RE2) does not match the name of the calling package (re2). This can lead to problems in calling code that expects `find_package` result variables (e.g., `_FOUND`) to follow a certain pattern. Call Stack (most recent call first): cmake_modules/FindRE2.cmake:63 (find_package_handle_standard_args) C:/Miniconda37-x64/envs/arrow/Library/lib/cmake/grpc/gRPCConfig.cmake:21 (find_package) cmake_modules/ThirdpartyToolchain.cmake:2472 (find_package) CMakeLists.txt:495 (include) This warning is for project developers. Use -Wno-dev to suppress it. CMake Error at cmake_modules/FindRE2.cmake:66 (add_library): add_library cannot create imported target "RE2::re2" because another target with the same name already exists. Call Stack (most recent call first): C:/Miniconda37-x64/envs/arrow/Library/lib/cmake/grpc/gRPCConfig.cmake:21 (find_package) cmake_modules/ThirdpartyToolchain.cmake:2472 (find_package) CMakeLists.txt:495 (include) {code} Note that this issue only occurs currently on case-insensitive file systems when ARROW_FLIGHT=ON is set. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9933) [Developer] Add drone as a CI provider for crossbow
Uwe Korn created ARROW-9933: --- Summary: [Developer] Add drone as a CI provider for crossbow Key: ARROW-9933 URL: https://issues.apache.org/jira/browse/ARROW-9933 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9929) [Developer] Autotune cmake-format
Uwe Korn created ARROW-9929: --- Summary: [Developer] Autotune cmake-format Key: ARROW-9929 URL: https://issues.apache.org/jira/browse/ARROW-9929 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9882) [C++/Python] Update conda-forge-pinning to 3 for OSX conda packages
Uwe Korn created ARROW-9882: --- Summary: [C++/Python] Update conda-forge-pinning to 3 for OSX conda packages Key: ARROW-9882 URL: https://issues.apache.org/jira/browse/ARROW-9882 Project: Apache Arrow Issue Type: Bug Components: C++, Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9879) [Python] ChunkedArray.__getitem__ doesn't work with numpy scalars
Uwe Korn created ARROW-9879: --- Summary: [Python] ChunkedArray.__getitem__ doesn't work with numpy scalars Key: ARROW-9879 URL: https://issues.apache.org/jira/browse/ARROW-9879 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.0, 1.0.1 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 2.0.0 {{import pyarrow as pa import numpy as np pa.chunked_array(pa.array([1,2]))[np.int32(0)]}} fails with error {{TypeError: key must either be a slice or integer}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9589) [C++/R] arrow_exports.h contains structs declared as class
Uwe Korn created ARROW-9589: --- Summary: [C++/R] arrow_exports.h contains structs declared as class Key: ARROW-9589 URL: https://issues.apache.org/jira/browse/ARROW-9589 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 1.0.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 2.0.0 This is an issue in an MSVC-based toolchain. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9588) [C++] clang/win: Copy constructor of ParquetInvalidOrCorruptedFileException not correctly triggered
Uwe Korn created ARROW-9588: --- Summary: [C++] clang/win: Copy constructor of ParquetInvalidOrCorruptedFileException not correctly triggered Key: ARROW-9588 URL: https://issues.apache.org/jira/browse/ARROW-9588 Project: Apache Arrow Issue Type: Bug Reporter: Uwe Korn The copy constructor of ParquetInvalidOrCorruptedFileException doesn't seem to be taken correctly when building with clang 9.0.1 on Windows in a MSVC toolchain. Adding {{ParquetInvalidOrCorruptedFileException(const ParquetInvalidOrCorruptedFileException&) = default;}} as an explicit copy constructor didn't help. Happy to any ideas here, probably a long shot as there are other clang-msvc problems. {code} [49/62] Building CXX object src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj FAILED: src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj C:\Users\Administrator\miniconda3\conda-bld\arrow-cpp-ext_1595962790058\_build_env\Library\bin\clang++.exe -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_S SE4_2 -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DAWS_COMMON_USE_IMPORT_EXPORT -DAWS_EVE NT_STREAM_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 -DAWS_SDK_VERSION_MINOR=7 -DAWS_SDK_VERSION_PATCH=164 -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DNOMINMAX -DPARQUET_EXPORTING -DUSE_IMPORT_EXPORT -DUSE_IMPORT _EXPORT=1 -DUSE_WINDOWS_DLL_SEMANTICS -D_CRT_SECURE_NO_WARNINGS -Dparquet_shared_EXPORTS -Isrc -I../src -I../src/generated -isystem ../thirdparty/flatbuffers/include -isystem C:/Users/Administrator/minico nda3/conda-bld/arrow-cpp-ext_1595962790058/_h_env/Library/include -isystem ../thirdparty/hadoop/include -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -march=k8 -mtune=haswell -ftree-vectorize -fstack-protector-strong -O2 -ffunction-sections -pipe -D_CRT_SECURE_NO_WARNINGS -D_MT -D_DLL -nostdlib -Xclang --dependent-lib=msvcrt -fuse-ld=lld -fno-aligned-allocation -Qunused-arguments -fcolor-diagn ostics -O3 -DNDEBUG -Wa,-mbig-obj -Wall -Wno-unknown-warning-option -Wno-pass-failed -msse4.2 -O3 -DNDEBUG -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -std=c++14 -MD -MT src/parquet/CMakeFiles/parquet _shared.dir/Unity/unity_1_cxx.cxx.obj -MF src\parquet\CMakeFiles\parquet_shared.dir\Unity\unity_1_cxx.cxx.obj.d -o src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx.obj -c src/parquet/CMakeF iles/parquet_shared.dir/Unity/unity_1_cxx.cxx In file included from src/parquet/CMakeFiles/parquet_shared.dir/Unity/unity_1_cxx.cxx:3: In file included from C:/Users/Administrator/miniconda3/conda-bld/arrow-cpp-ext_1595962790058/work/cpp/src/parquet/column_scanner.cc:18: In file included from ../src\parquet/column_scanner.h:29: In file included from ../src\parquet/column_reader.h:25: In file included from ../src\parquet/exception.h:26: In file included from ../src\parquet/platform.h:23: In file included from ../src\arrow/buffer.h:28: In file included from ../src\arrow/status.h:25: ../src\arrow/util/string_builder.h:49:10: error: invalid operands to binary expression ('std::ostream' (aka 'basic_ostream >') and 'parquet::ParquetInvalidOrCorruptedFileException' ) stream << head; ~~ ^ ../src\arrow/util/string_builder.h:61:3: note: in instantiation of function template specialization 'arrow::util::StringBuilderRecursive' requested here StringBuilderRecursive(ss.stream(), std::forward(args)...); ^ ../src\arrow/status.h:160:31: note: in instantiation of function template specialization 'arrow::util::StringBuilder' requested here return Status(code, util::StringBuilder(std::forward(args)...)); ^ ../src\arrow/status.h:204:20: note: in instantiation of function template specialization 'arrow::Status::FromArgs' requested here return Status::FromArgs(StatusCode::Invalid, std::forward(args)...); ^ ../src\parquet/exception.h:129:49: note: in instantiation of function template specialization 'arrow::Status::Invalid' requested here : ParquetStatusException(::arrow::Status::Invalid(std::forward(args)...)) {} ^ C:/Users/Administrator/miniconda3/conda-bld/arrow-cpp-ext_1595962790058/work/cpp/src/parquet/file_reader.cc:270:13: note: in instantiation of function template specialization 'parquet::ParquetInvalidOrCor ruptedFileException::ParquetInvalidOrCorruptedFileException' requested here throw ParquetInvalidOrCorruptedFileException("Parquet file size is 0 bytes"); ^ C:\BuildTools\VC\Tools\MSVC\14.16.27023\include\ostream:480:36: note: candidate function not viable: no known conversion from 'parquet::ParquetInvalidOrCorruptedFileException' to 'const void *' for 1st ar gument; take the
[jira] [Created] (ARROW-9560) [Packaging] conda recipes failing due to missing conda-forge.yml
Uwe Korn created ARROW-9560: --- Summary: [Packaging] conda recipes failing due to missing conda-forge.yml Key: ARROW-9560 URL: https://issues.apache.org/jira/browse/ARROW-9560 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9535) [Python] Remove symlink fixes from conda recipe
Uwe Korn created ARROW-9535: --- Summary: [Python] Remove symlink fixes from conda recipe Key: ARROW-9535 URL: https://issues.apache.org/jira/browse/ARROW-9535 Project: Apache Arrow Issue Type: Bug Components: Packaging, Python Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9504) [C++/Python] Segmentation fault on ChunkedArray.take
Uwe Korn created ARROW-9504: --- Summary: [C++/Python] Segmentation fault on ChunkedArray.take Key: ARROW-9504 URL: https://issues.apache.org/jira/browse/ARROW-9504 Project: Apache Arrow Issue Type: Bug Reporter: Uwe Korn Fix For: 1.0.0 This leads to a segementation fault with the latest conda nigthlies on Python 3.8 / macOS {code} import pyarrow as pa import numpy as np arr = pa.chunked_array([ [ "m", "J", "q", "k", "t" ], [ "m", "J", "q", "k", "t" ] ]) indices = np.array([0, 5, 1, 6, 2, 7, 3, 8, 4, 9]) arr.take(indices) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9489) [C++] Add fill_null kernel implementation for (array[string], scalar[string])
Uwe Korn created ARROW-9489: --- Summary: [C++] Add fill_null kernel implementation for (array[string], scalar[string]) Key: ARROW-9489 URL: https://issues.apache.org/jira/browse/ARROW-9489 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Uwe Korn Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9460) [C++] BinaryContainsExact doesn't cope with double characters in the pattern
Uwe Korn created ARROW-9460: --- Summary: [C++] BinaryContainsExact doesn't cope with double characters in the pattern Key: ARROW-9460 URL: https://issues.apache.org/jira/browse/ARROW-9460 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9433) [C++/Python] Add option to Take kernel to interpret negative indices as NULL
Uwe Korn created ARROW-9433: --- Summary: [C++/Python] Add option to Take kernel to interpret negative indices as NULL Key: ARROW-9433 URL: https://issues.apache.org/jira/browse/ARROW-9433 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Uwe Korn Fix For: 2.0.0 Currently negative integers are explicitly forbidding in the {{Take}} kernel. It would be nice to have the option to treat negative integers as NULL instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9432) [C++/Python] Add option to Take kernel to interpret negative indices as indexing from the right
Uwe Korn created ARROW-9432: --- Summary: [C++/Python] Add option to Take kernel to interpret negative indices as indexing from the right Key: ARROW-9432 URL: https://issues.apache.org/jira/browse/ARROW-9432 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Uwe Korn Fix For: 2.0.0 Currently negative integers are explicitly forbidding in the {{Take}} kernel. It would be nice to have the option to treat negative integers as "indices from the right" instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9431) [C++/Python] Kernel for SetItem(IntegerArray, values)
Uwe Korn created ARROW-9431: --- Summary: [C++/Python] Kernel for SetItem(IntegerArray, values) Key: ARROW-9431 URL: https://issues.apache.org/jira/browse/ARROW-9431 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Affects Versions: 2.0.0 Reporter: Uwe Korn We should have a kernel that allows overriding the values of an array using an integer array as the indexer and a scalar or array of equal length as the values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9430) [C++/Python] Kernel for SetItem(BooleanArray, values)
Uwe Korn created ARROW-9430: --- Summary: [C++/Python] Kernel for SetItem(BooleanArray, values) Key: ARROW-9430 URL: https://issues.apache.org/jira/browse/ARROW-9430 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Uwe Korn Fix For: 2.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9429) [Python] ChunkedArray.to_numpy
Uwe Korn created ARROW-9429: --- Summary: [Python] ChunkedArray.to_numpy Key: ARROW-9429 URL: https://issues.apache.org/jira/browse/ARROW-9429 Project: Apache Arrow Issue Type: New Feature Reporter: Uwe Korn Fix For: 2.0.0 Currently one needs to construct a {{pandas.Series}} and call {{values}} to get a numpy array as a result of {{ChunkedArray}}. We should provide a simpler {{to_numpy}} function that doesn't construct the {{pandas.Series}} overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9407) [Python] Accept pd.NA as missing value in array constructor
Uwe Korn created ARROW-9407: --- Summary: [Python] Accept pd.NA as missing value in array constructor Key: ARROW-9407 URL: https://issues.apache.org/jira/browse/ARROW-9407 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Uwe Korn Fix For: 2.0.0 Currently we don't support using {{pandas.NA}} at all: {code} In [1]: import pyarrow as pa In [2]: import pandas as pd In [3]: pa.array([pd.NA, "A"]) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.array([pd.NA, "A"]) ~/miniconda3/envs/fletcher/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib.array() ~/miniconda3/envs/fletcher/lib/python3.8/site-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() ~/miniconda3/envs/fletcher/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Could not convert with type NAType: did not recognize Python value type when inferring an Arrow data type {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9401) [C++/Python] Support necessary functionality to have an Arrow-string type in pandas
Uwe Korn created ARROW-9401: --- Summary: [C++/Python] Support necessary functionality to have an Arrow-string type in pandas Key: ARROW-9401 URL: https://issues.apache.org/jira/browse/ARROW-9401 Project: Apache Arrow Issue Type: Wish Components: C++, Python Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 2.0.0 This should serve as an umbrella issue for the needed functionality to have an Apache Arrow backed string type in {{pandas}}. In addition to the string kernels, we probably need to implement some more support functionality to efficiently the {{pandas}} interfaces. Some of these functions are already present in {{fletcher}} but a native string type in {{pandas}} should not have a hard dependency on {{numba}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9160) [C++] Implement string/binary contains for exact matches
Uwe Korn created ARROW-9160: --- Summary: [C++] Implement string/binary contains for exact matches Key: ARROW-9160 URL: https://issues.apache.org/jira/browse/ARROW-9160 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 1.0.0 Implement {{contains}} for exact matches of subportions of a string. Using the Knuth–Morris–Pratt algorithm, we should be able to do this in a linear runtime with a tiny bit of preprocessing at the invocation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9074) [GLib] Add missing arrow-json check
[ https://issues.apache.org/jira/browse/ARROW-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-9074. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7381 [https://github.com/apache/arrow/pull/7381] > [GLib] Add missing arrow-json check > --- > > Key: ARROW-9074 > URL: https://issues.apache.org/jira/browse/ARROW-9074 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9073) [C++] RapidJSON include directory detection doesn't work with RapidJSONConfig.cmake
[ https://issues.apache.org/jira/browse/ARROW-9073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-9073. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7380 [https://github.com/apache/arrow/pull/7380] > [C++] RapidJSON include directory detection doesn't work with > RapidJSONConfig.cmake > --- > > Key: ARROW-9073 > URL: https://issues.apache.org/jira/browse/ARROW-9073 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7893) [Developer][GLib] Document GLib development workflow when using conda environment on GTK-based Linux systems
[ https://issues.apache.org/jira/browse/ARROW-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128898#comment-17128898 ] Uwe Korn commented on ARROW-7893: - [~kou] Can you give me a pointer at which stage the library is loaded, i.e. where {{LD_LIBRARY_PATH}} does come into action? Then I can have a look at the conda packaging. > [Developer][GLib] Document GLib development workflow when using conda > environment on GTK-based Linux systems > > > Key: ARROW-7893 > URL: https://issues.apache.org/jira/browse/ARROW-7893 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, GLib >Reporter: Wes McKinney >Assignee: Kouhei Sutou >Priority: Major > > I periodically deal with annoying errors like: > {code} > checking for GLIB - version >= 2.32.4... > *** 'pkg-config --modversion glib-2.0' returned 2.58.3, but GLIB (2.56.4) > *** was found! If pkg-config was correct, then it is best > *** to remove the old version of GLib. You may also be able to fix the error > *** by modifying your LD_LIBRARY_PATH enviroment variable, or by editing > *** /etc/ld.so.conf. Make sure you have run ldconfig if that is > *** required on your system. > *** If pkg-config was wrong, set the environment variable PKG_CONFIG_PATH > *** to point to the correct configuration files > no > configure: error: GLib isn't available > make: *** No targets specified and no makefile found. Stop. > make: *** No rule to make target 'install'. Stop. > Traceback (most recent call last): > 2: from /home/wesm/code/arrow/c_glib/test/run-test.rb:30:in `' > 1: from /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in > `require' > /usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require': > cannot load such file -- gi (LoadError) > {code} > The problem is that I have one version of glib on my Linux system while > another in the activated conda environment, it seems that there is a conflict > even though {{$PKG_CONFIG_PATH}} is set to ignore system directories > https://gist.github.com/wesm/e62bf4517468be78200e8dd6db0fc544 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9066) [Python] Raise correct error in isnull()
Uwe Korn created ARROW-9066: --- Summary: [Python] Raise correct error in isnull() Key: ARROW-9066 URL: https://issues.apache.org/jira/browse/ARROW-9066 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127531#comment-17127531 ] Uwe Korn commented on ARROW-8961: - We should definitely run benchmarks as in the utf8proc issue tracker they mention that {{icu}} seems to be significantly faster than {{utf8proc}}. Still, {{icu}} is much fatter than {{utf8proc}} and we probably need exact the functionality that is part of {{utf8proc}}, not more. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available
[ https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125743#comment-17125743 ] Uwe Korn commented on ARROW-2079: - For the datasets, we write in \{{kartothek}}, we only write \{{_common_metadata}} (I think Apache Drill does the same). This is useful to have the schema for the whole dataset but writing the {{_metadata}} file with all information would be to expensive and in the {{kartothek}} case even useless. > [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't > available > --- > > Key: ARROW-2079 > URL: https://issues.apache.org/jira/browse/ARROW-2079 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Minor > Labels: dataset, dataset-parquet-read, parquet > > Currently pyarrow's parquet writer only writes `_common_metadata` and not > `_metadata`. From what I understand these are intended to contain the dataset > schema but not any row group information. > > A few (possibly naive) questions: > > 1. In the `__init__` for `ParquetDataset`, the following lines exist: > {code:java} > if self.metadata_path is not None: > with self.fs.open(self.metadata_path) as f: > self.common_metadata = ParquetFile(f).metadata > else: > self.common_metadata = None > {code} > I believe this should use `common_metadata_path` instead of `metadata_path`, > as the latter is never written by `pyarrow`, and is given by the `_metadata` > file instead of `_common_metadata` (as seemingly intended?). > > 2. In `validate_schemas` I believe an option should exist for using the > schema from `_common_metadata` instead of `_metadata`, as pyarrow currently > only writes the former, and as far as I can tell `_common_metadata` does > include all the schema information needed. > > Perhaps the logic in `validate_schemas` could be ported over to: > > {code:java} > if self.schema is not None: > pass # schema explicitly provided > elif self.metadata is not None: > self.schema = self.metadata.schema > elif self.common_metadata is not None: > self.schema = self.common_metadata.schema > else: > self.schema = self.pieces[0].get_metadata(open_file).schema{code} > If these changes are valid, I'd be happy to submit a PR. It's not 100% clear > to me the difference between `_common_metadata` and `_metadata`, but I > believe the schema in both should be the same. Figured I'd open this for > discussion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9026) [C++/Python] Force package removal from arrow-nightlies conda repository
Uwe Korn created ARROW-9026: --- Summary: [C++/Python] Force package removal from arrow-nightlies conda repository Key: ARROW-9026 URL: https://issues.apache.org/jira/browse/ARROW-9026 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9024) [C++/Python] Install anaconda-client in conda-clean job
Uwe Korn created ARROW-9024: --- Summary: [C++/Python] Install anaconda-client in conda-clean job Key: ARROW-9024 URL: https://issues.apache.org/jira/browse/ARROW-9024 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9023) [C++] Use mimalloc conda package
Uwe Korn created ARROW-9023: --- Summary: [C++] Use mimalloc conda package Key: ARROW-9023 URL: https://issues.apache.org/jira/browse/ARROW-9023 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4144) [Java] Arrow-to-JDBC
[ https://issues.apache.org/jira/browse/ARROW-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123484#comment-17123484 ] Uwe Korn commented on ARROW-4144: - Yes, the usecase would be to write large {{pandas.DataFrames}} to a database layer that only has performant JDBC drivers. Personally, all my JDBC sources are read-only and thus I didn't write a WriteToJDBC function but other people will also use these technologies with more access rights. I have used the "pyarrow->Arrow Java -> JDBC" successfully with Apache Drill and Denodo. I also heard that some people use it together with Amazon Athena and here a performant INSERT might be interesting [https://docs.aws.amazon.com/athena/latest/ug/insert-into.html] as the JDBC driver seems to be the most performant currently. > [Java] Arrow-to-JDBC > > > Key: ARROW-4144 > URL: https://issues.apache.org/jira/browse/ARROW-4144 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Michael Pigott >Assignee: Chen >Priority: Major > > ARROW-1780 reads a query from a JDBC data source and converts the ResultSet > to an Arrow VectorSchemaRoot. However, there is no built-in adapter for > writing an Arrow VectorSchemaRoot back to the database. > ARROW-3966 adds JDBC field metadata: > * The Catalog Name > * The Table Name > * The Field Name > * The Field Type > We can use this information to ask for the field information from the > database via the > [DatabaseMetaData|https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html] > object. We can then create INSERT or UPDATE statements based on the [list > of primary > keys|https://docs.oracle.com/javase/7/docs/api/java/sql/DatabaseMetaData.html#getPrimaryKeys(java.lang.String,%20java.lang.String,%20java.lang.String)] > in the table: > * If the value in the VectorSchemaRoot corresponding to the primary key is > NULL, insert that record into the database. > * If the value in the VectorSchemaRoot corresponding to the primary key is > not NULL, update the existing record in the database. > We can also perform the same data conversion in reverse based on the field > types queried from the database. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8941) [C++/Python] arrow-nightlies conda repository is full
[ https://issues.apache.org/jira/browse/ARROW-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn reassigned ARROW-8941: --- Assignee: Uwe Korn > [C++/Python] arrow-nightlies conda repository is full > - > > Key: ARROW-8941 > URL: https://issues.apache.org/jira/browse/ARROW-8941 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging, Python >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > You currently have 3 public packages and 0 packages that require to be > authenticated. > Using 10.0 GB of 3.0 GB storage > > We need a script to delete old packages, e.g. once a week? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8984) [R] Revise install guides now that Windows conda package exists
[ https://issues.apache.org/jira/browse/ARROW-8984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-8984. - Resolution: Fixed Issue resolved by pull request 7303 [https://github.com/apache/arrow/pull/7303] > [R] Revise install guides now that Windows conda package exists > --- > > Key: ARROW-8984 > URL: https://issues.apache.org/jira/browse/ARROW-8984 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117942#comment-17117942 ] Uwe Korn commented on ARROW-8961: - It's already there, named {{libutf8proc}}. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117446#comment-17117446 ] Uwe Korn commented on ARROW-8961: - For conda-forge and other distributions that can handle binary dependencies, we want to have use the system one. So we definitely need a ARROW_USE_SYSTEM_UTF8PROC option if we vendor. > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8962) [C++] Linking failure with clang-4.0
Uwe Korn created ARROW-8962: --- Summary: [C++] Linking failure with clang-4.0 Key: ARROW-8962 URL: https://issues.apache.org/jira/browse/ARROW-8962 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn {code:java} FAILED: release/arrow-file-to-stream : && /Users/uwe/miniconda3/envs/pyarrow-dev/bin/ccache /Users/uwe/miniconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++ -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG -Wall -Wno-unknown-warning-option -Wno-pass-failed -msse4.2 -O3 -DNDEBUG -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -Wl,-pie -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs src/arrow/ipc/CMakeFiles/arrow-file-to-stream.dir/file_to_stream.cc.o -o release/arrow-file-to-stream release/libarrow.a /usr/local/opt/openssl@1.1/lib/libssl.dylib /usr/local/opt/openssl@1.1/lib/libcrypto.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlienc-static.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlidec-static.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlicommon-static.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/liblz4.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libsnappy.1.1.7.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libz.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libzstd.dylib /Users/uwe/miniconda3/envs/pyarrow-dev/lib/liborc.a /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libprotobuf.dylib jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a && : Undefined symbols for architecture x86_64: "arrow::internal::(anonymous namespace)::StringToFloatConverterImpl::main_junk_value_", referenced from: arrow::internal::StringToFloat(char const*, unsigned long, float*) in libarrow.a(value_parsing.cc.o) arrow::internal::StringToFloat(char const*, unsigned long, double*) in libarrow.a(value_parsing.cc.o) "arrow::internal::(anonymous namespace)::StringToFloatConverterImpl::fallback_junk_value_", referenced from: arrow::internal::StringToFloat(char const*, unsigned long, float*) in libarrow.a(value_parsing.cc.o) arrow::internal::StringToFloat(char const*, unsigned long, double*) in libarrow.a(value_parsing.cc.o) ld: symbol(s) not found for architecture x86_64 clang-4.0: error: linker command failed with exit code 1 (use -v to see invocation) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8941) [C++/Python] arrow-nightlies conda repository is full
Uwe Korn created ARROW-8941: --- Summary: [C++/Python] arrow-nightlies conda repository is full Key: ARROW-8941 URL: https://issues.apache.org/jira/browse/ARROW-8941 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging, Python Reporter: Uwe Korn You currently have 3 public packages and 0 packages that require to be authenticated. Using 10.0 GB of 3.0 GB storage We need a script to delete old packages, e.g. once a week? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8810) Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108087#comment-17108087 ] Uwe Korn edited comment on ARROW-8810 at 5/15/20, 8:47 AM: --- Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties: * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes. * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer. Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that: * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications. * Reading wouldn't be faster than reading from a second file, even when doing it sequentially. * While append to a Parquet file, the file would be unreadable. * If your process crashes during write, all existing data in the Parquet file will be lost. * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead. Still if one would try to implement this, it would work as follows: # Read in the footer/metadata of the existing file. # Seek to the start position of the existing footer and overwrite it with the new data. # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file. If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata. With newer Arrow releases, there will be better support for Parquet datasets in R, I'll leave this to [~npr] or [~jorisvandenbossche] to link to the right docs. was (Author: xhochy): Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties: * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes. * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer. Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that: * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications. * Reading wouldn't be faster than reading from a second file, even when doing it sequentially. * While append to a Parquet file, the file would be unreadable. * If your process crashes during write, all existing data in the Parquet file will be lost. * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead. Still if one would try to implement this, it would work as follows: # Read in the footer/metadata of the existing file. # Seek to the start position of the existing footer and overwrite it with the new data. # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file. If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata. > Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new
[jira] [Commented] (ARROW-8810) Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108087#comment-17108087 ] Uwe Korn commented on ARROW-8810: - Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties: * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes. * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer. Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that: * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications. * Reading wouldn't be faster than reading from a second file, even when doing it sequentially. * While append to a Parquet file, the file would be unreadable. * If your process crashes during write, all existing data in the Parquet file will be lost. * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead. Still if one would try to implement this, it would work as follows: # Read in the footer/metadata of the existing file. # Seek to the start position of the existing footer and overwrite it with the new data. # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file. If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata. > Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8638) Arrow Cython API Usage Gives an error when calling CTable API Endpoints
[ https://issues.apache.org/jira/browse/ARROW-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096493#comment-17096493 ] Uwe Korn commented on ARROW-8638: - You either need to extend the environment variable `LD_LIBRARY_PATH` to point to the directory where `libarrow.so.16` resides or (a bit more complicated in setup.py but the preferred approach) set the RPATH on the generated `example.so` Python module to also include the directory where `libarrow.so.16` reside, see turbodbc for an example: https://github.com/blue-yonder/turbodbc/blob/8e2db0d0a26b620ad3e687e56a88fdab3117e09c/setup.py#L186-L189 > Arrow Cython API Usage Gives an error when calling CTable API Endpoints > --- > > Key: ARROW-8638 > URL: https://issues.apache.org/jira/browse/ARROW-8638 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.16.0 > Environment: Ubuntu 20.04 with Python 3.8.2 > RHEL7 with Python 3.6.8 >Reporter: Vibhatha Lakmal Abeykoon >Priority: Blocker > Fix For: 0.16.0 > > > I am working on using both Arrow C++ API and Cython API to support an > application that I am developing. But here, I will add the issue I > experienced when I am trying to follow the example, > [https://arrow.apache.org/docs/python/extending.html] > I am testing on Ubuntu 20.04 LTS > Python version 3.8.2 > These are the steps I followed. > # Create Virtualenv > python3 -m venv ENVARROW > > 2. Activate ENV > source ENVARROW/bin/activate > > 3. pip3 install pyarrow==0.16.0 cython numpy > > 4. Code block and Tools, > > +*example.pyx*+ > > > {code:java} > from pyarrow.lib cimport * > def get_array_length(obj): > # Just an example function accessing both the pyarrow Cython API > # and the Arrow C++ API > cdef shared_ptr[CArray] arr = pyarrow_unwrap_array(obj) > if arr.get() == NULL: > raise TypeError("not an array") > return arr.get().length() > def get_table_info(obj): > cdef shared_ptr[CTable] table = pyarrow_unwrap_table(obj) > if table.get() == NULL: > raise TypeError("not an table") > > return table.get().num_columns() > {code} > > > +*setup.py*+ > > > {code:java} > from distutils.core import setup > from Cython.Build import cythonize > import os > import numpy as np > import pyarrow as pa > ext_modules = cythonize("example.pyx") > for ext in ext_modules: > # The Numpy C headers are currently required > ext.include_dirs.append(np.get_include()) > ext.include_dirs.append(pa.get_include()) > ext.libraries.extend(pa.get_libraries()) > ext.library_dirs.extend(pa.get_library_dirs()) > if os.name == 'posix': > ext.extra_compile_args.append('-std=c++11') > # Try uncommenting the following line on Linux > # if you get weird linker errors or runtime crashes > #ext.define_macros.append(("_GLIBCXX_USE_CXX11_ABI", "0")) > setup(ext_modules=ext_modules) > {code} > > > +*arrow_array.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > arr = pa.array([1,2,3,4,5]) > len = example.get_array_length(arr) > print("Array length {} ".format(len)) > {code} > > +*arrow_table.py*+ > > {code:java} > import example > import pyarrow as pa > import numpy as np > from pyarrow import csv > fn = 'data.csv' > table = csv.read_csv(fn) > print(table) > cols = example.get_table_info(table) > print(cols) > > {code} > +*data.csv*+ > {code:java} > 1,2,3,4,5 > 6,7,8,9,10 > 11,12,13,14,15 > {code} > > +*Makefile*+ > > {code:java} > install: > python3 setup.py build_ext --inplace > clean: > rm -R *.so build *.cpp > {code} > > **When I try to run either of the python example scripts arrow_table.py or > arrow_array.py, > I get the following error. > > {code:java} > File "arrow_array.py", line 1, in > import example > ImportError: libarrow.so.16: cannot open shared object file: No such file or > directory > {code} > > > *Note: I also checked this on RHEL7 with Python 3.6.8, I got a similar > response.* > > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017
[ https://issues.apache.org/jira/browse/ARROW-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn updated ARROW-8571: Description: conda-forge did the switch, so we should follow this. > [C++] Switch AppVeyor image to VS 2017 > -- > > Key: ARROW-8571 > URL: https://issues.apache.org/jira/browse/ARROW-8571 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > > conda-forge did the switch, so we should follow this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8571) [C++] Switch AppVeyor image to VS 2017
Uwe Korn created ARROW-8571: --- Summary: [C++] Switch AppVeyor image to VS 2017 Key: ARROW-8571 URL: https://issues.apache.org/jira/browse/ARROW-8571 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8395) [Python] conda install pyarrow defaults to 0.11.1 not 0.16.0
[ https://issues.apache.org/jira/browse/ARROW-8395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17081654#comment-17081654 ] Uwe Korn commented on ARROW-8395: - What does a clean conda environment mean? A clean conda environment would have no packages in it but I expect that you mean that you have a full anaconda environment here. In that case, this won't work as you cannot mix packages between anaconda/defaults and conda-forge. Can you use {{conda create -n test pyarrow}} instead? > [Python] conda install pyarrow defaults to 0.11.1 not 0.16.0 > > > Key: ARROW-8395 > URL: https://issues.apache.org/jira/browse/ARROW-8395 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Environment: ubuntu 16, ubuntu 18, anaconda 2020.02 x64 >Reporter: dwang >Priority: Major > Labels: conda, conda-forge, install, pyarrow, python,, version > > When install pyarrow in clean linux conda environment (2020.02): > {code:java} > conda install -c conda-forge pyarrow > The following packages will be downloaded:package| > build > ---|- > arrow-cpp-0.11.1 |py37h0e61e49_1004 6.3 MB conda-forge > boost-cpp-1.68.0 |h11c811c_100020.5 MB conda-forge > conda-4.8.3| py37hc8dfbb8_1 3.0 MB conda-forge > libprotobuf-3.6.1 |hdbcaa40_1001 4.0 MB conda-forge > parquet-cpp-1.5.1 |3 3 KB conda-forge > pyarrow-0.11.1 |py37hbbcf98d_1002 2.0 MB conda-forge > python_abi-3.7 | 1_cp37m 4 KB conda-forge > thrift-cpp-0.12.0 |h0a07b25_1002 2.4 MB conda-forge > >Total:38.2 MB > {code} > The default version is pyarrow-0.11.1, while conda repo actually has the > latest version 0.16.0 ( [https://anaconda.org/conda-forge/pyarrow] ). > > Specify the version does not help: > conda install -c conda-forge pyarrow=0.16.0 > > > Workaround: > I have to manually download below packages from conda then install them > locally: > arrow-cpp-0.16.0-py37hb0edad2_0.tar.bz2 > aws-sdk-cpp-1.7.164-h1f8afcc_0.tar.bz2 > boost-cpp-1.70.0-h8e57a91_2.tar.bz2 > brotli-1.0.7-he1b5a44_1000.tar.bz2 > c-ares-1.15.0-h516909a_1001.tar.bz2 > gflags-2.2.2-he1b5a44_1002.tar.bz2 > glog-0.4.0-he1b5a44_1.tar.bz2 > grpc-cpp-1.25.0-h213be95_2.tar.bz2 > libprotobuf-3.11.3-h8b12597_0.tar.bz2 > lz4-c-1.8.3-he1b5a44_1001.tar.bz2 > parquet-cpp-1.5.1-1.tar.bz2 > pyarrow-0.16.0-py37h8b68381_1.tar.bz2 > re2-2020.01.01-he1b5a44_0.tar.bz2 > snappy-1.1.8-he1b5a44_1.tar.bz2 > thrift-cpp-0.12.0-hf3afdfd_1004.tar.bz2 > zstd-1.4.4-h3b9ef0a_1.tar.bz2 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8359) [C++/Python] Enable aarch64/ppc64le build in conda recipes
[ https://issues.apache.org/jira/browse/ARROW-8359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077154#comment-17077154 ] Uwe Korn commented on ARROW-8359: - [~kszucs] These builds are running on travis.*com* and drone.io, do we already have support for them in crossbow? > [C++/Python] Enable aarch64/ppc64le build in conda recipes > -- > > Key: ARROW-8359 > URL: https://issues.apache.org/jira/browse/ARROW-8359 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging, Python >Reporter: Uwe Korn >Priority: Major > Fix For: 0.17.0 > > > These two new arches were added in the conda recipes, we should also build > them as nightlies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8350) [Python] Implement to_numpy on ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-8350. - Resolution: Invalid We already support the {{__array__}} protocol and get the right output there, so this is not needed. > [Python] Implement to_numpy on ChunkedArray > --- > > Key: ARROW-8350 > URL: https://issues.apache.org/jira/browse/ARROW-8350 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe Korn >Priority: Major > > We support {{to_numpy}} on Array instances but not on {{ChunkedArray}} > instances. It would be quite useful to have it also there to support > returning e.g. non-nanosecond datetime instances. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8359) [C++/Python] Enable aarch64/ppc64le build in conda recipes
Uwe Korn created ARROW-8359: --- Summary: [C++/Python] Enable aarch64/ppc64le build in conda recipes Key: ARROW-8359 URL: https://issues.apache.org/jira/browse/ARROW-8359 Project: Apache Arrow Issue Type: Improvement Components: C++, Packaging, Python Reporter: Uwe Korn Fix For: 0.17.0 These two new arches were added in the conda recipes, we should also build them as nightlies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8149) [C++/Python] Enable CUDA Support in conda recipes
[ https://issues.apache.org/jira/browse/ARROW-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076921#comment-17076921 ] Uwe Korn commented on ARROW-8149: - Yes, PRs are open but there is still discussion. > [C++/Python] Enable CUDA Support in conda recipes > - > > Key: ARROW-8149 > URL: https://issues.apache.org/jira/browse/ARROW-8149 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Packaging >Reporter: Uwe Korn >Priority: Major > Fix For: 0.17.0 > > > See the changes in > [https://github.com/conda-forge/arrow-cpp-feedstock/pull/123], we need to > copy this into the Arrow repository and also test CUDA in these recipes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8350) [Python] Implement to_numpy on ChunkedArray
Uwe Korn created ARROW-8350: --- Summary: [Python] Implement to_numpy on ChunkedArray Key: ARROW-8350 URL: https://issues.apache.org/jira/browse/ARROW-8350 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe Korn We support {{to_numpy}} on Array instances but not on {{ChunkedArray}} instances. It would be quite useful to have it also there to support returning e.g. non-nanosecond datetime instances. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8288) [Python] Expose with_ modifiers on DataType
Uwe Korn created ARROW-8288: --- Summary: [Python] Expose with_ modifiers on DataType Key: ARROW-8288 URL: https://issues.apache.org/jira/browse/ARROW-8288 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 0.17.0 We have several {{WithX}} functions defined on {{DataType}} in C++ but only {{WithMetadata}} is yet exposed in Python. We should expose the rest of them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8285) [Python][Dataset] ScalarExpression doesn't accept numpy scalars
Uwe Korn created ARROW-8285: --- Summary: [Python][Dataset] ScalarExpression doesn't accept numpy scalars Key: ARROW-8285 URL: https://issues.apache.org/jira/browse/ARROW-8285 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe Korn {{pyarrow.dataset.ScalarExpression}} doesn't accept numpy scalars. This would be useful as values coming out of {{pandas}} or {{numpy}} are such. Example: {code:java} import pyarrow.dataset as ds import numpy as np ds.ScalarExpression(np.int64(2)){code} {code:java} --- TypeError Traceback (most recent call last) in > 1 ds.ScalarExpression(np.int64(2)) ~/miniconda3/envs/kartothek/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ScalarExpression.__init__() TypeError: Not yet supported scalar value: 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8284) [C++][Dataset] Schema evolution for timestamp columns
Uwe Korn created ARROW-8284: --- Summary: [C++][Dataset] Schema evolution for timestamp columns Key: ARROW-8284 URL: https://issues.apache.org/jira/browse/ARROW-8284 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8283) [C++/Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset
Uwe Korn created ARROW-8283: --- Summary: [C++/Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset Key: ARROW-8283 URL: https://issues.apache.org/jira/browse/ARROW-8283 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Uwe Korn When passing a list of files to the constructor of {{pyarrow.dataset.FileSystemData}}, all files that don't exist are silently dropped immediately (i.e. no fragments are created for them). Instead, I would expect that fragments will be created for them but an error is thrown when one tries to read the fragment with the non-existent file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8282) [C++/Python][Dataset] Support schema evolution for integer columns
Uwe Korn created ARROW-8282: --- Summary: [C++/Python][Dataset] Support schema evolution for integer columns Key: ARROW-8282 URL: https://issues.apache.org/jira/browse/ARROW-8282 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Uwe Korn When reading in a dataset where the schema specifies that column X is of type {{int64}} but the partition actually contains the data stored in that columns as {{int32}}, an upcast should be done. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8281) [R] Name collision of arrow.dll on Windows
Uwe Korn created ARROW-8281: --- Summary: [R] Name collision of arrow.dll on Windows Key: ARROW-8281 URL: https://issues.apache.org/jira/browse/ARROW-8281 Project: Apache Arrow Issue Type: Improvement Components: Packaging, R Affects Versions: 0.16.0 Reporter: Uwe Korn Currently we build the R extension for Windows only for CRAN with static linkage. For conda-forge, we though want to build it with dynamic linkage to {{arrow-cpp}}. Here we come into the issue that the R packages as well as the C++ package produces an {{arrow.dll}}. As there is no RPATH equivalent on Windows, the dynamic loader cannot find the right relatonship of both and fails to load the library. >From my point of view, the simplest approach here would be to name the R >{{arrow.dll}} differently, e.g. {{rarrow.dll}}. Would this be possible? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8148) [Packaging][C++] Add google-cloud-cpp to conda-forge
[ https://issues.apache.org/jira/browse/ARROW-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-8148. - Resolution: Fixed > [Packaging][C++] Add google-cloud-cpp to conda-forge > > > Key: ARROW-8148 > URL: https://issues.apache.org/jira/browse/ARROW-8148 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging >Reporter: Wes McKinney >Assignee: Uwe Korn >Priority: Major > > This is a requirement for ARROW-1231 to be able to move forward -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files
[ https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067856#comment-17067856 ] Uwe Korn commented on ARROW-5176: - Would be very happy with that! > [Python] Automate formatting of python files > > > Key: ARROW-5176 > URL: https://issues.apache.org/jira/browse/ARROW-5176 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Ben Kietzman >Priority: Minor > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > [Black](https://github.com/ambv/black) is a tool for automatically formatting > python code in ways which flake8 and our other linters approve of. Adding it > to the project will allow more reliably formatted python code and fill a > similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8223) [Python] Schema.from_pandas breaks with pandas nullable integer dtype
[ https://issues.apache.org/jira/browse/ARROW-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-8223. - Fix Version/s: 0.17.0 Assignee: Uwe Korn Resolution: Duplicate I fixed this recently in master. [~wesm] I maintain it, it simply works and thus doesn't need that much love except for the recent {{ExtensionArray}} fix. > [Python] Schema.from_pandas breaks with pandas nullable integer dtype > - > > Key: ARROW-8223 > URL: https://issues.apache.org/jira/browse/ARROW-8223 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0, 0.16.0, 0.15.1 > Environment: pyarrow 0.16 >Reporter: Ged Steponavicius >Assignee: Uwe Korn >Priority: Minor > Labels: easyfix > Fix For: 0.17.0 > > > > {code:java} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame([{'int_col':1}, > {'int_col':2}]) > df['int_col'] = df['int_col'].astype(pd.Int64Dtype()) > schema = pa.Schema.from_pandas(df) > {code} > produces ArrowTypeError: Did not pass numpy.dtype object > > However, this works fine > {code:java} > schema = pa.Table.from_pandas(df).schema{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8148) [Packaging][C++] Add google-cloud-cpp to conda-forge
[ https://issues.apache.org/jira/browse/ARROW-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17067503#comment-17067503 ] Uwe Korn commented on ARROW-8148: - PR: [https://github.com/conda-forge/staged-recipes/pull/11134] > [Packaging][C++] Add google-cloud-cpp to conda-forge > > > Key: ARROW-8148 > URL: https://issues.apache.org/jira/browse/ARROW-8148 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging >Reporter: Wes McKinney >Assignee: Uwe Korn >Priority: Major > > This is a requirement for ARROW-1231 to be able to move forward -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8148) [Packaging][C++] Add google-cloud-cpp to conda-forge
[ https://issues.apache.org/jira/browse/ARROW-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066707#comment-17066707 ] Uwe Korn commented on ARROW-8148: - This is more than a single package, we need at least [https://github.com/google/crc32c], [https://github.com/googleapis/cpp-cmakefiles], [https://github.com/googleapis/google-cloud-cpp-common] and [https://github.com/googleapis/google-cloud-cpp]. Along the way discovered that we only building static GRPC libs in conda-forge whereas we there only want shared libraries: [https://github.com/conda-forge/grpc-cpp-feedstock/pull/53] > [Packaging][C++] Add google-cloud-cpp to conda-forge > > > Key: ARROW-8148 > URL: https://issues.apache.org/jira/browse/ARROW-8148 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging >Reporter: Wes McKinney >Assignee: Uwe Korn >Priority: Major > > This is a requirement for ARROW-1231 to be able to move forward -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8148) [Packaging][C++] Add google-cloud-cpp to conda-forge
[ https://issues.apache.org/jira/browse/ARROW-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn reassigned ARROW-8148: --- Assignee: Uwe Korn > [Packaging][C++] Add google-cloud-cpp to conda-forge > > > Key: ARROW-8148 > URL: https://issues.apache.org/jira/browse/ARROW-8148 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Packaging >Reporter: Wes McKinney >Assignee: Uwe Korn >Priority: Major > > This is a requirement for ARROW-1231 to be able to move forward -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7816) [Integration] Turbodbc fails to compile in the nightly tests
[ https://issues.apache.org/jira/browse/ARROW-7816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-7816. - Assignee: Kouhei Sutou Resolution: Fixed This has been resolved in the meantime. > [Integration] Turbodbc fails to compile in the nightly tests > > > Key: ARROW-7816 > URL: https://issues.apache.org/jira/browse/ARROW-7816 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Krisztian Szucs >Assignee: Kouhei Sutou >Priority: Major > > Failing builds: > - > https://circleci.com/gh/ursa-labs/crossbow/8035?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link > - > https://circleci.com/gh/ursa-labs/crossbow/8035?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7871) [Python] Expose more compute kernels
[ https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065811#comment-17065811 ] Uwe Korn commented on ARROW-7871: - I would vouch to expose more kernels there instead of removing them. The main intent of having this module is have all kernels in a clear defined namespace which is not the top-level {{pyarrow}} one. You cannot use the {{pyarrow.compute}} module in the Array methods as this would introduce a cyclic dependency but you can directly call the C++ methods there. > [Python] Expose more compute kernels > > > Key: ARROW-7871 > URL: https://issues.apache.org/jira/browse/ARROW-7871 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Krisztian Szucs >Priority: Major > > Currently only the sum kernel is exposed. > Or consider to deprecate/remove the pyarrow.compute module, and bind the > compute kernels as methods instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-5074) [C++/Python] When installing into a SYSTEM prefix, RPATHs are not correctly set
[ https://issues.apache.org/jira/browse/ARROW-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-5074. - Resolution: Cannot Reproduce In my local build this seems to work fine. > [C++/Python] When installing into a SYSTEM prefix, RPATHs are not correctly > set > --- > > Key: ARROW-5074 > URL: https://issues.apache.org/jira/browse/ARROW-5074 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Packaging, Python >Reporter: Uwe Korn >Priority: Major > > When installing the Arrow libraries into a system with a prefix (mostly a > conda env), the RPATHs are not correctly set by CMake (there is no RPATH). > Thus we need to use {{LD_LIBRARY_PATH}} in consumers. When packages are built > using {{conda-build}}, this takes cares of that in its post-processing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3391) [Python] Support \0 characters in binary Parquet predicate values
[ https://issues.apache.org/jira/browse/ARROW-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065798#comment-17065798 ] Uwe Korn commented on ARROW-3391: - Have a look at the failing tests in [https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L1646-L1654] My problem is that I have a binary column with UUIDs (low entropy), there can be a zero-byte at any position inside the ID. When I now filter on this ID, e.g. "a\0dfsgjzdsaf" there were some steps that converted the value to C-style strings and thus in turn to a simple "a" instead of the whole identifier. > [Python] Support \0 characters in binary Parquet predicate values > - > > Key: ARROW-3391 > URL: https://issues.apache.org/jira/browse/ARROW-3391 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe Korn >Priority: Major > Labels: dataset, dataset-parquet-read, parquet > > As we convert the predicate values of a Parquet filter in some intermediate > steps to C-style strings, we currently disallow the use of binary and string > predicate values that contain {{\0}} bytes as they would otherwise result in > wrong results. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3054) [Packaging] Tooling to enable nightly conda packages to be updated to some anaconda.org channel
[ https://issues.apache.org/jira/browse/ARROW-3054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065781#comment-17065781 ] Uwe Korn commented on ARROW-3054: - [~kszucs] Can you link the correct ticket here and close this? > [Packaging] Tooling to enable nightly conda packages to be updated to some > anaconda.org channel > --- > > Key: ARROW-3054 > URL: https://issues.apache.org/jira/browse/ARROW-3054 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Affects Versions: 0.10.0 >Reporter: Phillip Cloud >Assignee: Krisztian Szucs >Priority: Major > Labels: conda > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8175) [Python] Setup type checking with mypy
Uwe Korn created ARROW-8175: --- Summary: [Python] Setup type checking with mypy Key: ARROW-8175 URL: https://issues.apache.org/jira/browse/ARROW-8175 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Python Reporter: Uwe Korn Assignee: Uwe Korn Get mypy checks running, activate things like {{check_untyped_defs}} later. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8174) [Python] Refactor context_choices in test_cuda_numba_interop to be a module level fixture
Uwe Korn created ARROW-8174: --- Summary: [Python] Refactor context_choices in test_cuda_numba_interop to be a module level fixture Key: ARROW-8174 URL: https://issues.apache.org/jira/browse/ARROW-8174 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe Korn Instead of being a global variable that is set/unset in setup_module/teardown_module -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8159) [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype
Uwe Korn created ARROW-8159: --- Summary: [Python] pyarrow.Schema.from_pandas doesn't support ExtensionDtype Key: ARROW-8159 URL: https://issues.apache.org/jira/browse/ARROW-8159 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Uwe Korn Assignee: Uwe Korn Fix For: 0.17.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8149) [C++/Python] Enable CUDA Support in conda recipes
Uwe Korn created ARROW-8149: --- Summary: [C++/Python] Enable CUDA Support in conda recipes Key: ARROW-8149 URL: https://issues.apache.org/jira/browse/ARROW-8149 Project: Apache Arrow Issue Type: New Feature Components: C++, Packaging Reporter: Uwe Korn Fix For: 0.17.0 See the changes in [https://github.com/conda-forge/arrow-cpp-feedstock/pull/123], we need to copy this into the Arrow repository and also test CUDA in these recipes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-5265) [Python/CI] Add integration test with kartothek
[ https://issues.apache.org/jira/browse/ARROW-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn reassigned ARROW-5265: --- Assignee: Uwe Korn > [Python/CI] Add integration test with kartothek > --- > > Key: ARROW-5265 > URL: https://issues.apache.org/jira/browse/ARROW-5265 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, Python >Reporter: Uwe Korn >Assignee: Uwe Korn >Priority: Major > Labels: parquet > > https://github.com/JDASoftwareGroup/kartothek is a heavy user of Apache Arrow > and thus a good indicator whether we have introduced some breakages in > {{pyarrow}}. Thus we should run regular integration tests against it as we do > with other libraries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8008) [C++/Python] Framework Python is preferred even though not the activated one
Uwe Korn created ARROW-8008: --- Summary: [C++/Python] Framework Python is preferred even though not the activated one Key: ARROW-8008 URL: https://issues.apache.org/jira/browse/ARROW-8008 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Uwe Korn Assignee: Uwe Korn Currently the framework Python is preferred on macOS eventhough development happens in a completely different Python runtime. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8007) [Python] Remove unused and defunct assert_get_object_equal in plasma tests
Uwe Korn created ARROW-8007: --- Summary: [Python] Remove unused and defunct assert_get_object_equal in plasma tests Key: ARROW-8007 URL: https://issues.apache.org/jira/browse/ARROW-8007 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Uwe Korn Assignee: Uwe Korn -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6766) [Python] libarrow_python..dylib does not exist
[ https://issues.apache.org/jira/browse/ARROW-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-6766. - Resolution: Cannot Reproduce > [Python] libarrow_python..dylib does not exist > -- > > Key: ARROW-6766 > URL: https://issues.apache.org/jira/browse/ARROW-6766 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.15.0 >Reporter: Tarek Allam >Priority: Major > > {{After following the instructions found on the developer guides for Python, > I was}} > {{able to build fine by using:}} > {{# Assuming immediately prior one has run:}} > {{# $ git clone g...@github.com:apache/arrow.git}} > # $ conda create -y -n pyarrow-dev -c conda-forge > # --file arrow/ci/conda_env_unix.yml > # --file arrow/ci/conda_env_cpp.yml > # --file arrow/ci/conda_env_python.yml > # compilers > {{# python=3.7}} > {{# $ conda activate pyarrow-dev}} > {{# $ brew update && brew bundle --file=arrow/cpp/Brewfile}}{{export > ARROW_HOME=$(pwd)/arrow/dist}} > {{export LD_LIBRARY_PATH=$(pwd)/arrow/dist/lib:$LD_LIBRARY_PATH}}{{export > CC=`which clang`}} > {{export CXX=`which clang++`}}{\{mkdir arrow/cpp/build }} > pushd arrow/cpp/build \ > cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DARROW_FLIGHT=OFF \ > -DARROW_GANDIVA=OFF \ > -DARROW_ORC=ON \ > -DARROW_PARQUET=ON \ > -DARROW_PYTHON=ON \ > -DARROW_PLASMA=ON \ > -DARROW_BUILD_TESTS=ON \ > .. > {{make -j4}} > {{make install}} > {{popd}} > But when I run: > {{pushd arrow/python}} > {{export PYARROW_WITH_FLIGHT=0}} > {{export PYARROW_WITH_GANDIVA=0}} > {{export PYARROW_WITH_ORC=1}} > {{export PYARROW_WITH_PARQUET=1}} > {{python setup.py build_ext --inplace}} > {{popd}} > I get the following errors: > {{-- Build output directory: > /Users/tallamjr/Github/arrow/python/build/temp.macosx-10.9-x86_64-3.7/release}} > {{-- Found the Arrow core library: > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}} > {{-- Found the Arrow Python library: > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}} > {{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not > exist.}}{{...}}{{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not exist.}} > {{CMake Error at CMakeLists.txt:230 (configure_file):}} > \{{ configure_file Problem configuring file}} > {{Call Stack (most recent call first):}} > \{{ CMakeLists.txt:315 (bundle_arrow_lib)}} > {{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not > exist.}} > {{CMake Error at CMakeLists.txt:226 (configure_file):}} > \{{ configure_file Problem configuring file}} > {{Call Stack (most recent call first):}} > \{{ CMakeLists.txt:320 (bundle_arrow_lib)}} > {{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not > exist.}} > {{CMake Error at CMakeLists.txt:230 (configure_file):}} > \{{ configure_file Problem configuring file}} > {{Call Stack (most recent call first):}} > \{{ CMakeLists.txt:320 (bundle_arrow_lib)}} > > What is quite strange is that the libraries seem to indeed be there but they > have an addition component such as `libarrow.15.dylib` .e.g: > {{$ ls -l libarrow_python.15.dylib && echo $PWD}} > {{lrwxr-xr-x 1 tallamjr staff 28 Oct 2 14:02 libarrow_python.15.dylib ->}} > {{libarrow_python.15.0.0.dylib}} > {{/Users/tallamjr/github/arrow/dist/lib}} > I guess I am not exactly sure what the issue here is but it appears to be that > the version is not captured as a variable that is used by CMAKE? I have run > the > same setup on `master` (`7d18c1c`) and on `apache-arrow-0.14.0` (`a591d76`) > which both seem to produce same errors. > Apologies if this is not quite the format for JIRA issues here or perhaps if > it's not the correct platform for this, I'm very new to the project and > contributing to apache in general. Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6766) [Python] libarrow_python..dylib does not exist
[ https://issues.apache.org/jira/browse/ARROW-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045812#comment-17045812 ] Uwe Korn commented on ARROW-6766: - Thanks for revistiting this! > [Python] libarrow_python..dylib does not exist > -- > > Key: ARROW-6766 > URL: https://issues.apache.org/jira/browse/ARROW-6766 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.15.0 >Reporter: Tarek Allam >Priority: Major > > {{After following the instructions found on the developer guides for Python, > I was}} > {{able to build fine by using:}} > {{# Assuming immediately prior one has run:}} > {{# $ git clone g...@github.com:apache/arrow.git}} > # $ conda create -y -n pyarrow-dev -c conda-forge > # --file arrow/ci/conda_env_unix.yml > # --file arrow/ci/conda_env_cpp.yml > # --file arrow/ci/conda_env_python.yml > # compilers > {{# python=3.7}} > {{# $ conda activate pyarrow-dev}} > {{# $ brew update && brew bundle --file=arrow/cpp/Brewfile}}{{export > ARROW_HOME=$(pwd)/arrow/dist}} > {{export LD_LIBRARY_PATH=$(pwd)/arrow/dist/lib:$LD_LIBRARY_PATH}}{{export > CC=`which clang`}} > {{export CXX=`which clang++`}}{\{mkdir arrow/cpp/build }} > pushd arrow/cpp/build \ > cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ > -DCMAKE_INSTALL_LIBDIR=lib \ > -DARROW_FLIGHT=OFF \ > -DARROW_GANDIVA=OFF \ > -DARROW_ORC=ON \ > -DARROW_PARQUET=ON \ > -DARROW_PYTHON=ON \ > -DARROW_PLASMA=ON \ > -DARROW_BUILD_TESTS=ON \ > .. > {{make -j4}} > {{make install}} > {{popd}} > But when I run: > {{pushd arrow/python}} > {{export PYARROW_WITH_FLIGHT=0}} > {{export PYARROW_WITH_GANDIVA=0}} > {{export PYARROW_WITH_ORC=1}} > {{export PYARROW_WITH_PARQUET=1}} > {{python setup.py build_ext --inplace}} > {{popd}} > I get the following errors: > {{-- Build output directory: > /Users/tallamjr/Github/arrow/python/build/temp.macosx-10.9-x86_64-3.7/release}} > {{-- Found the Arrow core library: > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow.dylib}} > {{-- Found the Arrow Python library: > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python.dylib}} > {{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not > exist.}}{{...}}{{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow..dylib does not exist.}} > {{CMake Error at CMakeLists.txt:230 (configure_file):}} > \{{ configure_file Problem configuring file}} > {{Call Stack (most recent call first):}} > \{{ CMakeLists.txt:315 (bundle_arrow_lib)}} > {{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not > exist.}} > {{CMake Error at CMakeLists.txt:226 (configure_file):}} > \{{ configure_file Problem configuring file}} > {{Call Stack (most recent call first):}} > \{{ CMakeLists.txt:320 (bundle_arrow_lib)}} > {{CMake Error: File > /usr/local/anaconda3/envs/pyarrow-dev/lib/libarrow_python..dylib does not > exist.}} > {{CMake Error at CMakeLists.txt:230 (configure_file):}} > \{{ configure_file Problem configuring file}} > {{Call Stack (most recent call first):}} > \{{ CMakeLists.txt:320 (bundle_arrow_lib)}} > > What is quite strange is that the libraries seem to indeed be there but they > have an addition component such as `libarrow.15.dylib` .e.g: > {{$ ls -l libarrow_python.15.dylib && echo $PWD}} > {{lrwxr-xr-x 1 tallamjr staff 28 Oct 2 14:02 libarrow_python.15.dylib ->}} > {{libarrow_python.15.0.0.dylib}} > {{/Users/tallamjr/github/arrow/dist/lib}} > I guess I am not exactly sure what the issue here is but it appears to be that > the version is not captured as a variable that is used by CMAKE? I have run > the > same setup on `master` (`7d18c1c`) and on `apache-arrow-0.14.0` (`a591d76`) > which both seem to produce same errors. > Apologies if this is not quite the format for JIRA issues here or perhaps if > it's not the correct platform for this, I'm very new to the project and > contributing to apache in general. Thanks > -- This message was sent by Atlassian Jira (v8.3.4#803005)