[jira] [Created] (ARROW-18284) CMake cannot find package configuration file in CMAKE_MODULE_PATH
ThisName created ARROW-18284: Summary: CMake cannot find package configuration file in CMAKE_MODULE_PATH Key: ARROW-18284 URL: https://issues.apache.org/jira/browse/ARROW-18284 Project: Apache Arrow Issue Type: Bug Reporter: ThisName Hi, I got exactly the same issue as described here: [https://github.com/apache/arrow/pull/14586] This happens to some people when they trying to install pyarrow from pip under windows. Since only opening an MR at github might not cause any attention so I open an issue here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18283) [R] Update Arrow for R cheatsheet to include GCS
Stephanie Hazlitt created ARROW-18283: - Summary: [R] Update Arrow for R cheatsheet to include GCS Key: ARROW-18283 URL: https://issues.apache.org/jira/browse/ARROW-18283 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Reporter: Stephanie Hazlitt The Arrow for R cheatsheet was released in 8.0.0. It could use an update to highlight new features released since then, for example reading+writing to Google Could Storage (in addition to S3). https://github.com/apache/arrow/tree/master/r/cheatsheet -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18282) [C++][Python] Support step slicing in list_slice kernel
Miles Granger created ARROW-18282: - Summary: [C++][Python] Support step slicing in list_slice kernel Key: ARROW-18282 URL: https://issues.apache.org/jira/browse/ARROW-18282 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Miles Granger Assignee: Miles Granger Fix For: 11.0.0 [GitHub PR 14395 | https://github.com/apache/arrow/pull/14395] adds the {{list_slice}} kernel, but does not implement the case where {{step != 1}}, which should implement step slicing other than 1. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18281) [C++][Python] Support start == stop in list_slice kernel
Miles Granger created ARROW-18281: - Summary: [C++][Python] Support start == stop in list_slice kernel Key: ARROW-18281 URL: https://issues.apache.org/jira/browse/ARROW-18281 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Miles Granger Assignee: Miles Granger Fix For: 11.0.0 [GitHub PR 14395 | https://github.com/apache/arrow/pull/14395] adds the {{list_slice}} kernel, but does not implement the case where {{ stop == stop }}, which should return empty lists. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18280) [C++][Python] Support slicing to arbitrary end in list_slice kernel
Miles Granger created ARROW-18280: - Summary: [C++][Python] Support slicing to arbitrary end in list_slice kernel Key: ARROW-18280 URL: https://issues.apache.org/jira/browse/ARROW-18280 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Miles Granger Assignee: Miles Granger Fix For: 11.0.0 [GitHub PR | https://github.com/apache/arrow/pull/14395] adds support for {{list_slice}} kernel, but does not implement what to do when {{stop == std::nullopt}}, which should slice to the end of the list elements. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18279) [C++][Python] Implement HashAggregate UDF
Vibhatha Lakmal Abeykoon created ARROW-18279: Summary: [C++][Python] Implement HashAggregate UDF Key: ARROW-18279 URL: https://issues.apache.org/jira/browse/ARROW-18279 Project: Apache Arrow Issue Type: Sub-task Components: C++, Python Reporter: Vibhatha Lakmal Abeykoon Assignee: Vibhatha Lakmal Abeykoon Fix For: 11.0.0 Implementing hash-aggregate user-defined functions and use these functions with `group_by`/ `agg` operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error
Rok Mihevc created ARROW-18278: -- Summary: [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error Key: ARROW-18278 URL: https://issues.apache.org/jira/browse/ARROW-18278 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Rok Mihevc When building with maven on M1 [as per docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]: {code:bash} mvn clean install mvn generate-resources -Pgenerate-libs-jni-macos-linux -N mvn -Darrow.cpp.build.dir=/arrow/java-dist/lib/ -Parrow-jni clean install {code} I get the following error: {code:bash} [INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root --- -- Building using CMake version: 3.24.2 -- The C compiler identification is AppleClang 14.0.0.1429 -- The CXX compiler identification is AppleClang 14.0.0.1429 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Library/Developer/CommandLineTools/usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Java: /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found version "11.0.16") -- Found JNI: /Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include found components: AWT JVM CMake Error at dataset/CMakeLists.txt:18 (find_package): By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "ArrowDataset", but CMake did not find one. Could not find a package configuration file provided by "ArrowDataset" with any of the following names: ArrowDatasetConfig.cmake arrowdataset-config.cmake Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set "ArrowDataset_DIR" to a directory containing one of the above files. If "ArrowDataset" provides a separate development package or SDK, be sure it has been installed. -- Configuring incomplete, errors occurred! See also "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log". See also "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log". [ERROR] Command execution failed. org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1) at org.apache.commons.exec.DefaultExecutor.executeInternal (DefaultExecutor.java:404) at org.apache.commons.exec.DefaultExecutor.execute (DefaultExecutor.java:166) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000) at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947) at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:370) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:351) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:171) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:163) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293) at org.apache.maven.cli.MavenCli.main (MavenCli.java:196) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:566) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
[jira] [Created] (ARROW-18277) Unable to install R's arrow on RStudio
Connor created ARROW-18277: -- Summary: Unable to install R's arrow on RStudio Key: ARROW-18277 URL: https://issues.apache.org/jira/browse/ARROW-18277 Project: Apache Arrow Issue Type: Bug Reporter: Connor Hello! Following the instructions on [https://arrow.apache.org/docs/r/articles/install.html] I am filing this ticket for help installing R's arrow package on RStudio. Output below {code:java} > Sys.setenv(ARROW_R_DEV=TRUE) > install.packages("arrow") Installing package into ‘/var/lib/rstudio-server/local/site-library’ (as ‘lib’ is unspecified) trying URL 'https://cran.rstudio.com/src/contrib/arrow_10.0.0.tar.gz' Content type 'application/x-gzip' length 4843530 bytes (4.6 MB) == downloaded 4.6 MB* installing *source* package ‘arrow’ ... ** package ‘arrow’ successfully unpacked and MD5 sums checked ** using staged installation *** Found local C++ source: 'tools/cpp' *** Building libarrow from source For build options and troubleshooting, see the install vignette: https://cran.r-project.org/web/packages/arrow/vignettes/install.html *** Building with MAKEFLAGS= -j2 cmake trying URL 'https://github.com/Kitware/CMake/releases/download/v3.21.4/cmake-3.21.4-linux-x86_64.tar.gz' Content type 'application/octet-stream' length 44684259 bytes (42.6 MB) == downloaded 42.6 MB arrow with SOURCE_DIR='tools/cpp' BUILD_DIR='/tmp/RtmpRnb6XO/file4484b64e7cde3' DEST_DIR='libarrow/arrow-10.0.0' CMAKE='/tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake' EXTRA_CMAKE_FLAGS='' CC='/usr/bin/gcc -fPIC' CXX='/usr/bin/g++ -fPIC -std=c++17' LDFLAGS='-L/usr/local/lib' ARROW_S3='OFF' ARROW_GCS='OFF' ++ pwd + : /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow + : tools/cpp + : /tmp/RtmpRnb6XO/file4484b64e7cde3 + : libarrow/arrow-10.0.0 + : /tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake ++ cd tools/cpp ++ pwd + SOURCE_DIR=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp ++ mkdir -p libarrow/arrow-10.0.0 ++ cd libarrow/arrow-10.0.0 ++ pwd + DEST_DIR=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/libarrow/arrow-10.0.0 ++ nproc + : 16 + '[' '' '!=' '' ']' + '[' '' = false ']' + ARROW_DEFAULT_PARAM=OFF + mkdir -p /tmp/RtmpRnb6XO/file4484b64e7cde3 + pushd /tmp/RtmpRnb6XO/file4484b64e7cde3 /tmp/RtmpRnb6XO/file4484b64e7cde3 /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow + /tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON -DARROW_DEPENDENCY_SOURCE=AUTO -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON -DARROW_GCS=OFF -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=ON -DARROW_JSON=ON -DARROW_PARQUET=ON -DARROW_S3=OFF -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF -DARROW_WITH_LZ4=ON -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_UTF8PROC=ON -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF -DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_DEBUG_MODE=OFF -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_INSTALL_PREFIX=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/libarrow/arrow-10.0.0 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF -Dxsimd_SOURCE= -Dzstd_SOURCE= -G 'Unix Makefiles' /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp -- Building using CMake version: 3.21.4 -- The C compiler identification is GNU 6.3.0 -- The CXX compiler identification is GNU 6.3.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/g++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 10.0.0 (full: '10.0.0') -- Arrow SO version: 1000 (full: 1000.0.0) -- clang-tidy 14 not found -- clang-format 14 not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found -- Found Python3: /usr/local/bin/python3.9 (found version "3.9.4") found components: Interpreter -- Found cpplint executable at /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp/build-support/cpplint.py -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Success -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Success -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Success -- Arrow build warning level: PRODUCTION -- Using ld linker -- Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build
[jira] [Created] (ARROW-18276) Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] Opening HDFS file
Moritz Meister created ARROW-18276: -- Summary: Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] Opening HDFS file Key: ARROW-18276 URL: https://issues.apache.org/jira/browse/ARROW-18276 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 10.0.0 Environment: pyarrow 10.0.0 fsspec 2022.7.1 pandas 1.3.3 python 3.8.11. Reporter: Moritz Meister Hey! I am trying to read a CSV file using pyarrow together with fsspec from HDFS. I used to do this with pyarrow 9.0.0 and fsspec 2022.7.1, however, after I upgraded to pyarrow 10.0.0 this stopped working. I am not quite sure if this is an incompatibility introduced in the new pyarrow version or if it is a Bug in fsspec. So if I am in the wrong place here, please let me know. Apart from pyarrow 10.0.0 and fsspec 2022.7.1, I am using pandas version 1.3.3 and python 3.8.11. Here is the full stack trace ```python pd.read_csv("hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-0-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv") --- OSError Traceback (most recent call last) /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 584 kwds.update(kwds_defaults) 585 --> 586 return _read(filepath_or_buffer, kwds) 587 588 /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds) 480 481 # Create the parser. --> 482 parser = TextFileReader(filepath_or_buffer, **kwds) 483 484 if chunksize or iterator: /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds) 809 self.options["has_index_names"] = kwds["has_index_names"] 810 --> 811 self._engine = self._make_engine(self.engine) 812 813 def close(self): /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine) 1038 ) 1039 # error: Too many arguments for "ParserBase" -> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg] 1041 1042 def _failover_to_python(self): /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds) 49 50 # open handles ---> 51 self._open_handles(src, kwds) 52 assert self.handles is not None 53 /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds) 220 Let the readers open IOHandles after they are done with their potential raises. 221 """ --> 222 self.handles = get_handle( 223 src, 224 "r", /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 607 608 # open URLs --> 609 ioargs = _get_filepath_or_buffer( 610 path_or_buf, 611 encoding=encoding, /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/common.py in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options) 356 357 try: --> 358 file_obj = fsspec.open( 359 filepath_or_buffer, mode=fsspec_mode, **(storage_options or {}) 360 ).open() /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/fsspec/core.py in open(self) 133 during the life of the file-like it generates. 134 """ --> 135 return self.__enter__() 136 137 def close(self): /srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/fsspec/core.py in __enter__(self) 101 mode = self.mode.replace("t", "").replace("b", "") + "b" 102 --> 103 f = self.fs.open(self.path, mode=mode) 104
[jira] [Created] (ARROW-18275) Allow custom reader/writer implementation for arrow dataset read/write path
Chang She created ARROW-18275: - Summary: Allow custom reader/writer implementation for arrow dataset read/write path Key: ARROW-18275 URL: https://issues.apache.org/jira/browse/ARROW-18275 Project: Apache Arrow Issue Type: New Feature Components: Python Affects Versions: 10.0.0 Reporter: Chang She We're implementing a "versionable" data format where the read/write path has some metadata handling which we currently can't plug into the native pyarrow write_dataset and pa.dataset.dataset mechanism. What we've done currently is have our own `lance.write_dataset` and `lance.dataset` interfaces which knows about the versioning. And if you use the native arrow ones, it reads/writes an unversioned dataset. It would be great if: 1. the arrow interfaces provided a way for custom data formats to provide their own Arrow compliant reader/writer implementations, so we can delete our custom interface and stick with native pyarrow interface. 2. the pyarrow interface can support custom kwargs like "version=5" or "as_of=" or "version='latest'" for reference, this is what our custom C++ dataset implementation looks like: https://github.com/eto-ai/lance/blob/main/cpp/include/lance/arrow/dataset.h -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18274) [Go] Sparse union of structs is buggy
Laurent Querel created ARROW-18274: -- Summary: [Go] Sparse union of structs is buggy Key: ARROW-18274 URL: https://issues.apache.org/jira/browse/ARROW-18274 Project: Apache Arrow Issue Type: Bug Components: Go Affects Versions: 10.0.0 Reporter: Laurent Querel Union of structs is currently buggy in V10. See the following example. {code:go} dt1 := arrow.SparseUnionOf([]arrow.Field{ {Name: "c", Type: { IndexType: arrow.PrimitiveTypes.Uint16, ValueType: arrow.BinaryTypes.String, Ordered: false, }} , }, []arrow.UnionTypeCode{0}) dt2 := arrow.SparseUnionOf([]arrow.Field { \{Name: "a", Type: dt1} , }, []arrow.UnionTypeCode{0}) pool := memory.NewGoAllocator() array := array.NewSparseUnionBuilder(pool, dt2) {code} The created array is unusable because the memo table of the dictionary builder (field 'c') is nil. When I replace the struct by a second union (so 2 nested union), the dictionary builder is properly initialized. First analysis: - The `NewSparseUnionBuilder` calls the builders for each variant and also calls defer builder.Release. - The Struct Release method calls the Release methods of every field even if the internal counter is not 0, so the Release method of the second union is called followed by the Release method of the dictionary. This bug doesn't happen with 2 nested unions as the internal counter is properly tested. In the first place I don't understand why the Release method of each variant is call just after the creation of the Union builder. I also don't understand why the Release method of the Struct calls the Release method of each field independently of the value of the internal counter. Any idea? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18273) For extension types, compute kernels should default to storage types?
Chang She created ARROW-18273: - Summary: For extension types, compute kernels should default to storage types? Key: ARROW-18273 URL: https://issues.apache.org/jira/browse/ARROW-18273 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 10.0.0 Reporter: Chang She Currently, compute kernels don't recognize extensions types so that if you were to define semantic types to indicate things like "this string column is an image label", you then cannot do things like equals on it. For example, take the LabelType from https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py ``` In [1]: import pyarrow as pa In [2]: import pyarrow.compute as pc In [3]: class LabelType(pa.PyExtensionType): ...: ...: def __init__(self): ...: pa.PyExtensionType.__init__(self, pa.string()) ...: ...: def __reduce__(self): ...: return LabelType, () ...: In [4]: tbl = pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), pa.array(['cat', 'dog', 'person']))], names=['label']) In [5]: tbl.filter(pc.field('label') == 'cat') --- ArrowNotImplementedError Traceback (most recent call last) Cell In [5], line 1 > 1 tbl.filter(pc.field('label') == 'cat') File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in pyarrow.lib.Table.filter() File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, in pyarrow._exec_plan._filter_table() File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, in pyarrow._exec_plan.execplan() File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status() ArrowNotImplementedError: Function 'equal' has no kernel matching input types (extension>, string) ``` for query systems that push some of the compute down to Arrow (e.g., DuckDB), it also means that it's much harder for users to work with datasets with extension types because you don't know which functions will actually work. Instead, if we can make the compute kernels default to the storage type, it would make the extension system a lot easier to work with in Arrow. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-julia] alex-s-gardner opened a new issue, #359: Arrow changes data type from input in unexpected ways
alex-s-gardner opened a new issue, #359: URL: https://github.com/apache/arrow-julia/issues/359 From this MWE input is unrecognizable from output (the path to the Zarr file is public and can be run locally): ``` dc = Zarr.zopen("http://its-live-data.s3.amazonaws.com/datacubes/v02/N20E100/ITS_LIVE_vel_EPSG32647_G0120_X65_Y325.zarr;) C = dc["satellite_img1"][:] input = DataFrame([C,C],:auto) Arrow.write("test.arrow", input) output = Arrow.Table("test.arrow") ``` `input.x1` looks like this: ``` 1460-element Vector{Zarr.MaxLengthStrings.MaxLengthString{2, UInt32}}: "1A" ⋮ "8." ``` while `output.x1` looks like this: ``` 1460-element Arrow.List{String, Int32, Vector{UInt8}}: "1\0" ⋮ "\0\0" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string
Zepu Zhang created ARROW-18272: -- Summary: [pyarrow] ParquetFile does not recognize GCS cloud path as a string Key: ARROW-18272 URL: https://issues.apache.org/jira/browse/ARROW-18272 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 10.0.0 Reporter: Zepu Zhang I have a Parquet file at path = 'gs://mybucket/abc/d.parquet' `pyarrow.parquet.read_metadata(path)` works fine. `pyarrow.parquet.ParquetFile(path)` raises "Failed to open local file 'gs://mybucket/abc/d.parquet'. Looks like ParquetFile misses the path resolution logic found in `read_metadata` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18271) [C++] Remove GlobalForkSafeMutex
Antoine Pitrou created ARROW-18271: -- Summary: [C++] Remove GlobalForkSafeMutex Key: ARROW-18271 URL: https://issues.apache.org/jira/browse/ARROW-18271 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 11.0.0 Now that we have a proper at-fork facility, the {{GlobalForkSafeMutex}} has probably become pointless and therefore can be removed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18270) [Python] Remove gcc 4.9 compatibility code
Antoine Pitrou created ARROW-18270: -- Summary: [Python] Remove gcc 4.9 compatibility code Key: ARROW-18270 URL: https://issues.apache.org/jira/browse/ARROW-18270 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Since we now require a C++17-compliant compiler, we don't support gcc 4.9 anymore. The following code can probably be simplified: https://github.com/apache/arrow/blob/619b034bd3e14937fa5d12f8e86fa83e7444b886/python/pyarrow/src/arrow/python/datetime.cc#L41 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18269) Slash character in partition value handling
Vadym Dytyniak created ARROW-18269: -- Summary: Slash character in partition value handling Key: ARROW-18269 URL: https://issues.apache.org/jira/browse/ARROW-18269 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 10.0.0 Reporter: Vadym Dytyniak Provided example shows that pyarrow does not handle partition value that contains '/' correctly: {code:java} import pandas as pd import pyarrow as pa from pyarrow import dataset as ds df = pd.DataFrame({ 'value': [1, 2], 'instrument_id': ['A/Z', 'B'], }) ds.write_dataset( data=pa.Table.from_pandas(df), base_dir='data', format='parquet', partitioning=['instrument_id'], partitioning_flavor='hive', ) table = ds.dataset( source='data', format='parquet', partitioning='hive', ).to_table() tables = [table] df = pa.concat_tables(tables).to_pandas() tables = [table] df = pa.concat_tables(tables).to_pandas() print(df.head()){code} {code:java} value instrument_id 0 1 A 1 2 B {code} Expected behaviour: Option 1: Result should be: {code:java} value instrument_id 0 1 A/Z 1 2 B {code} Option 2: Error should be raised to avoid '/' in partition value. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18268) [Poss]
Lorenzo Isella created ARROW-18268: -- Summary: [Poss] Key: ARROW-18268 URL: https://issues.apache.org/jira/browse/ARROW-18268 Project: Apache Arrow Issue Type: Bug Reporter: Lorenzo Isella -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18267) [R] Possible bug in Handling Blank Conversion to Missing Value
Lorenzo Isella created ARROW-18267: -- Summary: [R] Possible bug in Handling Blank Conversion to Missing Value Key: ARROW-18267 URL: https://issues.apache.org/jira/browse/ARROW-18267 Project: Apache Arrow Issue Type: Bug Reporter: Lorenzo Isella -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18266) [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one
Nicola Crane created ARROW-18266: Summary: [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one Key: ARROW-18266 URL: https://issues.apache.org/jira/browse/ARROW-18266 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane It's not all that clear from our docs that if we want to read in a Parquet file and change the schema, we need to call the {{cast()}} method on the Table, e.g. {code:r} # Write out data data <- tibble::tibble(x = c(letters[1:5], NA), y = 1:6) data_with_schema <- arrow_table(data, schema = schema(x = string(), y = int64())) write_parquet(data_with_schema, "data_with_schema.parquet") # Read in data while specifying a schema data_in <- read_parquet("data_with_schema.parquet", as_data_frame = FALSE) data_in$cast(target_schema = schema(x = string(), y = int32())) {code} We should document this more clearly. Pehaps we could even update the code here to automatically do some of this if we pass in a schema to the {...} argument of {{read_parquet}} _and_ the returned data doesn't match the desired schema? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18265) [C++] Allow FieldPath to work with ListElement
Miles Granger created ARROW-18265: - Summary: [C++] Allow FieldPath to work with ListElement Key: ARROW-18265 URL: https://issues.apache.org/jira/browse/ARROW-18265 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Miles Granger Assignee: Miles Granger Fix For: 11.0.0 {{FieldRef::FromDotPath}} can parse a single list element field. ie. {{{}'path.to.list[0]`{}}}but does not work in practice. Failing with: _struct_field: cannot subscript field of type list<>_ Being able to add a slice or multiple list elements is not within the scope of this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18264) [Python] Add Time64Scalar.value field
created ARROW-18264: Summary: [Python] Add Time64Scalar.value field Key: ARROW-18264 URL: https://issues.apache.org/jira/browse/ARROW-18264 Project: Apache Arrow Issue Type: Improvement Environment: pyarrow==10.0.0 No pandas installed Reporter: At the moment, when pandas is not installed, it is not possible to access the underlying value for a Time64Scalar of "ns" precision without casting it to int64. {code:java} time_ns = pa.array([1, 2, 3],pa.time64("ns")) scalar = time_ns[0] scalar.as_py() {code} Raises: {code:java} time_ns = pa.array([1, 2, 3],pa.time64("ns")) scalar = time_ns[0] scalar.as_py() {code} The workaround is to do: {code:java} scalar.cast(pa.int64()).as_py() {code} It'd be good if a value field was added to Time64Scalar, just like the TimestampScalar {code:java} timestamp_ns = pa.array([1, 2, 3],pa.timestamp("ns", "UTC")) scalar = timestamp_ns[0] scalar.value {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18263) [R] Error when trying to write POSIXlt data to CSV
Nicola Crane created ARROW-18263: Summary: [R] Error when trying to write POSIXlt data to CSV Key: ARROW-18263 URL: https://issues.apache.org/jira/browse/ARROW-18263 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane I get an error trying to write a tibble of POSIXlt data to a file. The error is a bit misleading as it refers to the column being of length 0. {code:r} posixlt_data <- tibble::tibble(x = as.POSIXlt(Sys.time())) write_csv_arrow(posixlt_data, "posixlt_data.csv") {code} {code:r} Error: Invalid: Unsupported Type:POSIXlt of length 0 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18262) [Archery][CI] New version of pygit2 fails to import and make archery commands to fail
Raúl Cumplido created ARROW-18262: - Summary: [Archery][CI] New version of pygit2 fails to import and make archery commands to fail Key: ARROW-18262 URL: https://issues.apache.org/jira/browse/ARROW-18262 Project: Apache Arrow Issue Type: Bug Components: Archery, Continuous Integration Reporter: Raúl Cumplido Assignee: Raúl Cumplido The new version of pygit2==1.11.0 published seems to have some issues. Some of our nightly jobs that require pygit2 are failing. As an example we have stopped receiving nightly reports. The issue is tracked on pygit2 here: https://github.com/libgit2/pygit2/issues/1176 I can reproduce locally: {code:java} -> import pygit2 (Pdb) n ImportError: libssl-9ad06800.so.1.1.1k: cannot open shared object file: No such file or directory > /home/raulcd/code/arrow/dev/archery/archery/crossbow/core.py(45)(){code} We probably should pin pygit2 to be <1.11.0 in the meantime. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18261) Mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 phòng ngủ cho anh Minh
Đinh Thanh Tùng created ARROW-18261: --- Summary: Mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 phòng ngủ cho anh Minh Key: ARROW-18261 URL: https://issues.apache.org/jira/browse/ARROW-18261 Project: Apache Arrow Issue Type: New Feature Reporter: Đinh Thanh Tùng Xem chi tiết mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 phòng ngủ cho anh Minh tại [https: // thietkenoithatatz.com/thiet-ke/thiet-ke-noi-t hat-chung-cu-grand-sapphire -gs-can-1-phong-ngu-hien-dai /|https://thietkenoithatatz.com/thiet-ke/thiet-ke-noi-that-chung-cu-grand-sapphire-gs-can-1-phong-ngu-hien-dai/] -- This message was sent by Atlassian Jira (v8.20.10#820010)