date:20221107

[jira] [Created] (ARROW-18284) CMake cannot find package configuration file in CMAKE_MODULE_PATH

2022-11-07 Thread ThisName (Jira)

ThisName created ARROW-18284:


 Summary: CMake cannot find package configuration file in 
CMAKE_MODULE_PATH
 Key: ARROW-18284
 URL: https://issues.apache.org/jira/browse/ARROW-18284
 Project: Apache Arrow
  Issue Type: Bug
Reporter: ThisName


Hi, I got exactly the same issue as described here:

[https://github.com/apache/arrow/pull/14586]

This happens to some people when they trying to install pyarrow from pip under 
windows. Since only opening an MR at github might not cause any attention so I 
open an issue here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18283) [R] Update Arrow for R cheatsheet to include GCS

2022-11-07 Thread Stephanie Hazlitt (Jira)

Stephanie Hazlitt created ARROW-18283:
-

 Summary: [R] Update Arrow for R cheatsheet to include GCS
 Key: ARROW-18283
 URL: https://issues.apache.org/jira/browse/ARROW-18283
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Reporter: Stephanie Hazlitt


The Arrow for R cheatsheet was released in 8.0.0. It could use an update to 
highlight new features released since then, for example reading+writing to 
Google Could Storage (in addition to S3).

https://github.com/apache/arrow/tree/master/r/cheatsheet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18282) [C++][Python] Support step slicing in list_slice kernel

2022-11-07 Thread Miles Granger (Jira)

Miles Granger created ARROW-18282:
-

 Summary: [C++][Python] Support step slicing in list_slice kernel
 Key: ARROW-18282
 URL: https://issues.apache.org/jira/browse/ARROW-18282
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Miles Granger
Assignee: Miles Granger
 Fix For: 11.0.0


[GitHub PR 14395 | https://github.com/apache/arrow/pull/14395] adds the 
{{list_slice}} kernel, but does not implement the case where {{step != 1}}, 
which should implement step slicing other than 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18281) [C++][Python] Support start == stop in list_slice kernel

2022-11-07 Thread Miles Granger (Jira)

Miles Granger created ARROW-18281:
-

 Summary: [C++][Python] Support start == stop in list_slice kernel
 Key: ARROW-18281
 URL: https://issues.apache.org/jira/browse/ARROW-18281
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Miles Granger
Assignee: Miles Granger
 Fix For: 11.0.0


[GitHub PR 14395 | https://github.com/apache/arrow/pull/14395] adds the 
{{list_slice}} kernel, but does not implement the case where {{ stop == stop 
}}, which should return empty lists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18280) [C++][Python] Support slicing to arbitrary end in list_slice kernel

2022-11-07 Thread Miles Granger (Jira)

Miles Granger created ARROW-18280:
-

 Summary: [C++][Python] Support slicing to arbitrary end in 
list_slice kernel
 Key: ARROW-18280
 URL: https://issues.apache.org/jira/browse/ARROW-18280
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Miles Granger
Assignee: Miles Granger
 Fix For: 11.0.0


[GitHub PR | https://github.com/apache/arrow/pull/14395] adds support for 
{{list_slice}} kernel, but does not implement what to do when {{stop == 
std::nullopt}}, which should slice to the end of the list elements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18279) [C++][Python] Implement HashAggregate UDF

2022-11-07 Thread Vibhatha Lakmal Abeykoon (Jira)

Vibhatha Lakmal Abeykoon created ARROW-18279:


 Summary: [C++][Python] Implement HashAggregate UDF 
 Key: ARROW-18279
 URL: https://issues.apache.org/jira/browse/ARROW-18279
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Python
Reporter: Vibhatha Lakmal Abeykoon
Assignee: Vibhatha Lakmal Abeykoon
 Fix For: 11.0.0


Implementing hash-aggregate user-defined functions and use these functions with 
`group_by`/ `agg` operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error

2022-11-07 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-18278:
--

 Summary: [Java] Maven generate-libs-jni-macos-linux on M1 fails 
due to cmake error
 Key: ARROW-18278
 URL: https://issues.apache.org/jira/browse/ARROW-18278
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Rok Mihevc


When building with maven on M1 [as per 
docs|https://arrow.apache.org/docs/dev/developers/java/building.html#id3]:
{code:bash}
mvn clean install
mvn generate-resources -Pgenerate-libs-jni-macos-linux -N
mvn -Darrow.cpp.build.dir=/arrow/java-dist/lib/ -Parrow-jni 
clean install
{code}
I get the following error:
{code:bash}
[INFO] --- exec-maven-plugin:3.1.0:exec (jni-cmake) @ arrow-java-root ---
-- Building using CMake version: 3.24.2
-- The C compiler identification is AppleClang 14.0.0.1429
-- The CXX compiler identification is AppleClang 14.0.0.1429
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Library/Developer/CommandLineTools/usr/bin/cc 
- skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: 
/Library/Developer/CommandLineTools/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Java: 
/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/bin/java (found 
version "11.0.16") 
-- Found JNI: 
/Library/Java/JavaVirtualMachines/zulu-11.jdk/Contents/Home/include  found 
components: AWT JVM 
CMake Error at dataset/CMakeLists.txt:18 (find_package):
  By not providing "FindArrowDataset.cmake" in CMAKE_MODULE_PATH this project
  has asked CMake to find a package configuration file provided by
  "ArrowDataset", but CMake did not find one.

  Could not find a package configuration file provided by "ArrowDataset" with
  any of the following names:

ArrowDatasetConfig.cmake
arrowdataset-config.cmake

  Add the installation prefix of "ArrowDataset" to CMAKE_PREFIX_PATH or set
  "ArrowDataset_DIR" to a directory containing one of the above files.  If
  "ArrowDataset" provides a separate development package or SDK, be sure it
  has been installed.


-- Configuring incomplete, errors occurred!
See also "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeOutput.log".
See also "/Users/rok/Documents/repos/arrow/java-jni/CMakeFiles/CMakeError.log".
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit 
value: 1)
at org.apache.commons.exec.DefaultExecutor.executeInternal 
(DefaultExecutor.java:404)
at org.apache.commons.exec.DefaultExecutor.execute 
(DefaultExecutor.java:166)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:1000)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:947)
at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:471)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
(DefaultBuildPluginManager.java:137)
at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
(MojoExecutor.java:370)
at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
(MojoExecutor.java:351)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:171)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:163)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:117)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
 (SingleThreadedBuilder.java:56)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960)
at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:566)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
(Launcher.java:282)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
(Launcher.java:225)

[jira] [Created] (ARROW-18277) Unable to install R's arrow on RStudio

2022-11-07 Thread Connor (Jira)

Connor created ARROW-18277:
--

 Summary: Unable to install R's arrow on RStudio
 Key: ARROW-18277
 URL: https://issues.apache.org/jira/browse/ARROW-18277
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Connor


Hello! Following the instructions on 
[https://arrow.apache.org/docs/r/articles/install.html] I am filing this ticket 
for help installing R's arrow package on RStudio. Output below

 
{code:java}
> Sys.setenv(ARROW_R_DEV=TRUE)
> install.packages("arrow")
Installing package into ‘/var/lib/rstudio-server/local/site-library’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/src/contrib/arrow_10.0.0.tar.gz'
Content type 'application/x-gzip' length 4843530 bytes (4.6 MB)
==
downloaded 4.6 MB* installing *source* package ‘arrow’ ...
** package ‘arrow’ successfully unpacked and MD5 sums checked
** using staged installation
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
    For build options and troubleshooting, see the install vignette:
    https://cran.r-project.org/web/packages/arrow/vignettes/install.html
*** Building with MAKEFLAGS= -j2 
 cmake
trying URL 
'https://github.com/Kitware/CMake/releases/download/v3.21.4/cmake-3.21.4-linux-x86_64.tar.gz'
Content type 'application/octet-stream' length 44684259 bytes (42.6 MB)
==
downloaded 42.6 MB arrow with SOURCE_DIR='tools/cpp' 
BUILD_DIR='/tmp/RtmpRnb6XO/file4484b64e7cde3' DEST_DIR='libarrow/arrow-10.0.0' 
CMAKE='/tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake' 
EXTRA_CMAKE_FLAGS='' CC='/usr/bin/gcc -fPIC' CXX='/usr/bin/g++ -fPIC 
-std=c++17' LDFLAGS='-L/usr/local/lib' ARROW_S3='OFF' ARROW_GCS='OFF' 
++ pwd
+ : /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow
+ : tools/cpp
+ : /tmp/RtmpRnb6XO/file4484b64e7cde3
+ : libarrow/arrow-10.0.0
+ : /tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake
++ cd tools/cpp
++ pwd
+ SOURCE_DIR=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp
++ mkdir -p libarrow/arrow-10.0.0
++ cd libarrow/arrow-10.0.0
++ pwd
+ DEST_DIR=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/libarrow/arrow-10.0.0
++ nproc
+ : 16
+ '[' '' '!=' '' ']'
+ '[' '' = false ']'
+ ARROW_DEFAULT_PARAM=OFF
+ mkdir -p /tmp/RtmpRnb6XO/file4484b64e7cde3
+ pushd /tmp/RtmpRnb6XO/file4484b64e7cde3
/tmp/RtmpRnb6XO/file4484b64e7cde3 /tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow
+ /tmp/RtmpRnb6XO/file4484b4c7f3eba/cmake-3.21.4-linux-x86_64/bin/cmake 
-DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=OFF 
-DARROW_BUILD_STATIC=ON -DARROW_COMPUTE=ON -DARROW_CSV=ON -DARROW_DATASET=ON 
-DARROW_DEPENDENCY_SOURCE=AUTO -DAWSSDK_SOURCE= -DARROW_FILESYSTEM=ON 
-DARROW_GCS=OFF -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=ON -DARROW_JSON=ON 
-DARROW_PARQUET=ON -DARROW_S3=OFF -DARROW_WITH_BROTLI=OFF -DARROW_WITH_BZ2=OFF 
-DARROW_WITH_LZ4=ON -DARROW_WITH_RE2=ON -DARROW_WITH_SNAPPY=ON 
-DARROW_WITH_UTF8PROC=ON -DARROW_WITH_ZLIB=OFF -DARROW_WITH_ZSTD=OFF 
-DARROW_VERBOSE_THIRDPARTY_BUILD=OFF -DCMAKE_BUILD_TYPE=Release 
-DCMAKE_FIND_DEBUG_MODE=OFF -DCMAKE_INSTALL_LIBDIR=lib 
-DCMAKE_INSTALL_PREFIX=/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/libarrow/arrow-10.0.0
 -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON 
-DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON -DCMAKE_UNITY_BUILD=OFF 
-Dxsimd_SOURCE= -Dzstd_SOURCE= -G 'Unix Makefiles' 
/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp
-- Building using CMake version: 3.21.4
-- The C compiler identification is GNU 6.3.0
-- The CXX compiler identification is GNU 6.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 10.0.0 (full: '10.0.0')
-- Arrow SO version: 1000 (full: 1000.0.0)
-- clang-tidy 14 not found
-- clang-format 14 not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found Python3: /usr/local/bin/python3.9 (found version "3.9.4") found 
components: Interpreter 
-- Found cpplint executable at 
/tmp/RtmpppnJl6/R.INSTALL448006afb2688/arrow/tools/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Success
-- Arrow build warning level: PRODUCTION
-- Using ld linker
-- Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build

[jira] [Created] (ARROW-18276) Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] Opening HDFS file

2022-11-07 Thread Moritz Meister (Jira)

Moritz Meister created ARROW-18276:
--

 Summary: Reading from hdfs using pyarrow 10.0.0 throws OSError: 
[Errno 22] Opening HDFS file
 Key: ARROW-18276
 URL: https://issues.apache.org/jira/browse/ARROW-18276
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 10.0.0
 Environment: pyarrow 10.0.0
fsspec 2022.7.1
pandas 1.3.3
python 3.8.11.
Reporter: Moritz Meister


Hey!
I am trying to read a CSV file using pyarrow together with fsspec from HDFS.
I used to do this with pyarrow 9.0.0 and fsspec 2022.7.1, however, after I 
upgraded to pyarrow 10.0.0 this stopped working.

I am not quite sure if this is an incompatibility introduced in the new pyarrow 
version or if it is a Bug in fsspec. So if I am in the wrong place here, please 
let me know.

Apart from pyarrow 10.0.0 and fsspec 2022.7.1, I am using pandas version 1.3.3 
and python 3.8.11.

Here is the full stack trace
```python
pd.read_csv("hdfs://10.0.2.15:8020/Projects/testing/testing_Training_Datasets/transactions_view_fraud_batch_fv_1_1/validation/part-0-42b57ad2-57eb-4a63-bfaa-7375e82863e8-c000.csv")

---
OSError                                   Traceback (most recent call last)

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py
 in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, 
usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, 
true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, 
na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, 
infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, 
iterator, chunksize, compression, thousands, decimal, lineterminator, 
quotechar, quoting, doublequote, escapechar, comment, encoding, 
encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, 
delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py
 in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py
 in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/readers.py
 in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: 
ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py
 in __init__(self, src, **kwds)
     49 
     50         # open handles
---> 51         self._open_handles(src, kwds)
     52         assert self.handles is not None
     53 

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/parsers/base_parser.py
 in _open_handles(self, src, kwds)
    220         Let the readers open IOHandles after they are done with their 
potential raises.
    221         """
--> 222         self.handles = get_handle(
    223             src,
    224             "r",

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/common.py 
in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, 
errors, storage_options)
    607 
    608     # open URLs
--> 609     ioargs = _get_filepath_or_buffer(
    610         path_or_buf,
    611         encoding=encoding,

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/pandas/io/common.py 
in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, 
storage_options)
    356 
    357         try:
--> 358             file_obj = fsspec.open(
    359                 filepath_or_buffer, mode=fsspec_mode, 
**(storage_options or {})
    360             ).open()

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/fsspec/core.py in 
open(self)
    133         during the life of the file-like it generates.
    134         """
--> 135         return self.__enter__()
    136 
    137     def close(self):

/srv/hops/anaconda/envs/theenv/lib/python3.8/site-packages/fsspec/core.py in 
__enter__(self)
    101         mode = self.mode.replace("t", "").replace("b", "") + "b"
    102 
--> 103         f = self.fs.open(self.path, mode=mode)
    104

[jira] [Created] (ARROW-18275) Allow custom reader/writer implementation for arrow dataset read/write path

2022-11-07 Thread Chang She (Jira)

Chang She created ARROW-18275:
-

 Summary: Allow custom reader/writer implementation for arrow 
dataset read/write path
 Key: ARROW-18275
 URL: https://issues.apache.org/jira/browse/ARROW-18275
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 10.0.0
Reporter: Chang She


We're implementing a "versionable" data format where the read/write path has 
some metadata handling which we currently can't plug into the native pyarrow 
write_dataset and pa.dataset.dataset mechanism.

What we've done currently is have our own `lance.write_dataset` and 
`lance.dataset` interfaces which knows about the versioning. And if you use the 
native arrow ones, it reads/writes an unversioned dataset.

It would be great if:

1. the arrow interfaces provided a way for custom data formats to provide their 
own Arrow compliant reader/writer implementations, so we can delete our custom 
interface and stick with native pyarrow interface.

2. the pyarrow interface can support custom kwargs like "version=5" or 
"as_of=" or "version='latest'"

for reference, this is what our custom C++ dataset implementation looks like: 
https://github.com/eto-ai/lance/blob/main/cpp/include/lance/arrow/dataset.h



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18274) [Go] Sparse union of structs is buggy

2022-11-07 Thread Laurent Querel (Jira)

Laurent Querel created ARROW-18274:
--

 Summary: [Go] Sparse union of structs is buggy
 Key: ARROW-18274
 URL: https://issues.apache.org/jira/browse/ARROW-18274
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Affects Versions: 10.0.0
Reporter: Laurent Querel


Union of structs is currently buggy in V10. See the following example.

 
{code:go}
dt1 := arrow.SparseUnionOf([]arrow.Field{
{Name: "c", Type: 
{ IndexType: arrow.PrimitiveTypes.Uint16, ValueType: arrow.BinaryTypes.String, 
Ordered: false, }}
,
}, []arrow.UnionTypeCode{0})
dt2 := arrow.SparseUnionOf([]arrow.Field
{ \{Name: "a", Type: dt1}
,
}, []arrow.UnionTypeCode{0})
pool := memory.NewGoAllocator()
array := array.NewSparseUnionBuilder(pool, dt2) {code}
 

The created array is unusable because the memo table of the dictionary builder 
(field 'c') is nil.

 

When I replace the struct by a second union (so 2 nested union), the dictionary 
builder is properly initialized.

First analysis:
 - The `NewSparseUnionBuilder` calls the builders for each variant and also 
calls defer builder.Release. 
 - The Struct Release method calls the Release methods of every field even if 
the internal counter is not 0, so the Release method of the second union is 
called followed by the Release method of the dictionary. 

This bug doesn't happen with 2 nested unions as the internal counter is 
properly tested.

In the first place I don't understand why the Release method of each variant is 
call just after the creation of the Union builder. I also don't understand why 
the Release method of the Struct calls the Release method of each field 
independently of the value of the internal counter.

Any idea?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18273) For extension types, compute kernels should default to storage types?

2022-11-07 Thread Chang She (Jira)

Chang She created ARROW-18273:
-

 Summary: For extension types, compute kernels should default to 
storage types?
 Key: ARROW-18273
 URL: https://issues.apache.org/jira/browse/ARROW-18273
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 10.0.0
Reporter: Chang She


Currently, compute kernels don't recognize extensions types so that if you were 
to define semantic types to indicate things like "this string column is an 
image label", you then cannot do things like equals on it.

For example, take the LabelType from 
https://github.com/apache/arrow/blob/c3824db8530075e0f8fd26974c193a310003c17a/python/pyarrow/tests/test_extension_type.py

```
In [1]: import pyarrow as pa

In [2]: import pyarrow.compute as pc

In [3]: class LabelType(pa.PyExtensionType):
   ...:
   ...: def __init__(self):
   ...: pa.PyExtensionType.__init__(self, pa.string())
   ...:
   ...: def __reduce__(self):
   ...: return LabelType, ()
   ...:

In [4]: tbl = pa.Table.from_arrays([pa.ExtensionArray.from_storage(LabelType(), 
pa.array(['cat', 'dog', 'person']))], names=['label'])

In [5]: tbl.filter(pc.field('label') == 'cat')
---
ArrowNotImplementedError  Traceback (most recent call last)
Cell In [5], line 1
> 1 tbl.filter(pc.field('label') == 'cat')

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/table.pxi:2953, in 
pyarrow.lib.Table.filter()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:391, in 
pyarrow._exec_plan._filter_table()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/_exec_plan.pyx:128, in 
pyarrow._exec_plan.execplan()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:144, in 
pyarrow.lib.pyarrow_internal_check_status()

File ~/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'equal' has no kernel matching input types 
(extension>, string)
```

for query systems that push some of the compute down to Arrow (e.g., DuckDB), 
it also means that it's much harder for users to work with datasets with 
extension types because you don't know which functions will actually work.


Instead, if we can make the compute kernels default to the storage type, it 
would make the extension system a lot easier to work with in Arrow.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-julia] alex-s-gardner opened a new issue, #359: Arrow changes data type from input in unexpected ways

2022-11-07 Thread GitBox



alex-s-gardner opened a new issue, #359:
URL: https://github.com/apache/arrow-julia/issues/359

   From this MWE input is unrecognizable from output (the path to the Zarr file 
is public and can be run locally):
   
   ```
   dc = 
Zarr.zopen("http://its-live-data.s3.amazonaws.com/datacubes/v02/N20E100/ITS_LIVE_vel_EPSG32647_G0120_X65_Y325.zarr;)
   C = dc["satellite_img1"][:]
   input = DataFrame([C,C],:auto)
   Arrow.write("test.arrow", input)
   output = Arrow.Table("test.arrow")
   ```
   
   `input.x1` looks like this:
   ```
   1460-element Vector{Zarr.MaxLengthStrings.MaxLengthString{2, UInt32}}:
"1A"
⋮
"8."
   ```
   
   while `output.x1` looks like this:
   ```
   1460-element Arrow.List{String, Int32, Vector{UInt8}}:
"1\0"
⋮
"\0\0"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string

2022-11-07 Thread Zepu Zhang (Jira)

Zepu Zhang created ARROW-18272:
--

 Summary: [pyarrow] ParquetFile does not recognize GCS cloud path 
as a string
 Key: ARROW-18272
 URL: https://issues.apache.org/jira/browse/ARROW-18272
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 10.0.0
Reporter: Zepu Zhang


I have a Parquet file at

 

path = 'gs://mybucket/abc/d.parquet'

 

`pyarrow.parquet.read_metadata(path)` works fine.

 

`pyarrow.parquet.ParquetFile(path)` raises "Failed to open local file 
'gs://mybucket/abc/d.parquet'.

 

Looks like ParquetFile misses the path resolution logic found in `read_metadata`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18271) [C++] Remove GlobalForkSafeMutex

2022-11-07 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18271:
--

 Summary: [C++] Remove GlobalForkSafeMutex
 Key: ARROW-18271
 URL: https://issues.apache.org/jira/browse/ARROW-18271
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 11.0.0


Now that we have a proper at-fork facility, the {{GlobalForkSafeMutex}} has 
probably become pointless and therefore can be removed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18270) [Python] Remove gcc 4.9 compatibility code

2022-11-07 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-18270:
--

 Summary: [Python] Remove gcc 4.9 compatibility code
 Key: ARROW-18270
 URL: https://issues.apache.org/jira/browse/ARROW-18270
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Since we now require a C++17-compliant compiler, we don't support gcc 4.9 
anymore. The following code can probably be simplified:
https://github.com/apache/arrow/blob/619b034bd3e14937fa5d12f8e86fa83e7444b886/python/pyarrow/src/arrow/python/datetime.cc#L41



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18269) Slash character in partition value handling

2022-11-07 Thread Vadym Dytyniak (Jira)

Vadym Dytyniak created ARROW-18269:
--

 Summary: Slash character in partition value handling
 Key: ARROW-18269
 URL: https://issues.apache.org/jira/browse/ARROW-18269
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 10.0.0
Reporter: Vadym Dytyniak


 

Provided example shows that pyarrow does not handle partition value that 
contains '/' correctly:
{code:java}
import pandas as pd
import pyarrow as pa

from pyarrow import dataset as ds

df = pd.DataFrame({
'value': [1, 2],
'instrument_id': ['A/Z', 'B'],
})

ds.write_dataset(
data=pa.Table.from_pandas(df),
base_dir='data',
format='parquet',
partitioning=['instrument_id'],
partitioning_flavor='hive',
)

table = ds.dataset(
source='data',
format='parquet',
partitioning='hive',
).to_table()

tables = [table]

df = pa.concat_tables(tables).to_pandas()  tables = [table]

df = pa.concat_tables(tables).to_pandas() 

print(df.head()){code}
 
{code:java}
   value instrument_id
0      1             A
1      2             B {code}
Expected behaviour:
Option 1: Result should be:
{code:java}
   value instrument_id
0      1             A/Z
1      2             B {code}
Option 2: Error should be raised to avoid '/' in partition value.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18268) [Poss]

2022-11-07 Thread Lorenzo Isella (Jira)

Lorenzo Isella created ARROW-18268:
--

 Summary: [Poss]
 Key: ARROW-18268
 URL: https://issues.apache.org/jira/browse/ARROW-18268
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lorenzo Isella






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18267) [R] Possible bug in Handling Blank Conversion to Missing Value

2022-11-07 Thread Lorenzo Isella (Jira)

Lorenzo Isella created ARROW-18267:
--

 Summary: [R] Possible bug in Handling Blank Conversion to Missing 
Value
 Key: ARROW-18267
 URL: https://issues.apache.org/jira/browse/ARROW-18267
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Lorenzo Isella






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18266) [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one

2022-11-07 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18266:


 Summary: [R] Make it more obvious how to read in a Parquet file 
with a different schema to the inferred one
 Key: ARROW-18266
 URL: https://issues.apache.org/jira/browse/ARROW-18266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


It's not all that clear from our docs that if we want to read in a Parquet file 
and change the schema, we need to call the {{cast()}} method on the Table, e.g. 

{code:r}
# Write out data
data <- tibble::tibble(x = c(letters[1:5], NA), y = 1:6)
data_with_schema <- arrow_table(data, schema = schema(x = string(), y = 
int64()))
write_parquet(data_with_schema, "data_with_schema.parquet")

# Read in data while specifying a schema
data_in <- read_parquet("data_with_schema.parquet", as_data_frame = FALSE)  
data_in$cast(target_schema = schema(x = string(), y = int32()))
{code}

We should document this more clearly. Pehaps we could even update the code here 
to automatically do some of this if we pass in a schema to the {...} argument 
of {{read_parquet}} _and_ the returned data doesn't match the desired schema? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

2022-11-07 Thread Miles Granger (Jira)

Miles Granger created ARROW-18265:
-

 Summary: [C++] Allow FieldPath to work with ListElement
 Key: ARROW-18265
 URL: https://issues.apache.org/jira/browse/ARROW-18265
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Miles Granger
Assignee: Miles Granger
 Fix For: 11.0.0


{{FieldRef::FromDotPath}} can parse a single list element field. ie. 
{{{}'path.to.list[0]`{}}}but does not work in practice. Failing with:

_struct_field: cannot subscript field of type list<>_
Being able to add a slice or multiple list elements is not within the scope of 
this issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18264) [Python] Add Time64Scalar.value field

2022-11-07 Thread (Jira)

 created ARROW-18264:


 Summary: [Python] Add Time64Scalar.value field
 Key: ARROW-18264
 URL: https://issues.apache.org/jira/browse/ARROW-18264
 Project: Apache Arrow
  Issue Type: Improvement
 Environment: pyarrow==10.0.0
No pandas installed
Reporter: 


At the moment, when pandas is not installed, it is not possible to access the 
underlying value for a Time64Scalar of "ns" precision without casting it to 
int64.
{code:java}
time_ns = pa.array([1, 2, 3],pa.time64("ns"))
scalar = time_ns[0]
scalar.as_py() {code}
Raises:
{code:java}
time_ns = pa.array([1, 2, 3],pa.time64("ns"))
scalar = time_ns[0]
scalar.as_py() {code}
The workaround is to do:
{code:java}
scalar.cast(pa.int64()).as_py() {code}
It'd be good if a value field was added to Time64Scalar, just like the 
TimestampScalar
{code:java}
timestamp_ns = pa.array([1, 2, 3],pa.timestamp("ns", "UTC"))
scalar = timestamp_ns[0]
scalar.value {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18263) [R] Error when trying to write POSIXlt data to CSV

2022-11-07 Thread Nicola Crane (Jira)

Nicola Crane created ARROW-18263:


 Summary: [R] Error when trying to write POSIXlt data to CSV
 Key: ARROW-18263
 URL: https://issues.apache.org/jira/browse/ARROW-18263
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


I get an error trying to write a tibble of POSIXlt data to a file.  The error 
is a bit misleading as it refers to the column being of length 0.

{code:r}
posixlt_data <- tibble::tibble(x = as.POSIXlt(Sys.time()))
write_csv_arrow(posixlt_data, "posixlt_data.csv")
{code}


{code:r}
Error: Invalid: Unsupported Type:POSIXlt of length 0
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18262) [Archery][CI] New version of pygit2 fails to import and make archery commands to fail

2022-11-07 Thread Jira

Raúl Cumplido created ARROW-18262:
-

 Summary: [Archery][CI] New version of pygit2 fails to import and 
make archery commands to fail
 Key: ARROW-18262
 URL: https://issues.apache.org/jira/browse/ARROW-18262
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery, Continuous Integration
Reporter: Raúl Cumplido
Assignee: Raúl Cumplido


The new version of pygit2==1.11.0 published seems to have some issues. Some of 
our nightly jobs that require pygit2 are failing. As an example we have stopped 
receiving nightly reports. The issue is tracked on pygit2 here:
https://github.com/libgit2/pygit2/issues/1176
I can reproduce locally:
{code:java}
 -> import pygit2
(Pdb) n
ImportError: libssl-9ad06800.so.1.1.1k: cannot open shared object file: No such 
file or directory
> /home/raulcd/code/arrow/dev/archery/archery/crossbow/core.py(45)(){code}
We probably should pin pygit2 to be <1.11.0 in the meantime.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18261) Mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 phòng ngủ cho anh Minh

2022-11-07 Thread Jira

Đinh Thanh Tùng created ARROW-18261:
---

 Summary: Mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 
phòng ngủ cho anh Minh
 Key: ARROW-18261
 URL: https://issues.apache.org/jira/browse/ARROW-18261
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Đinh Thanh Tùng


Xem chi tiết mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 phòng ngủ 
cho anh Minh tại [https: // thietkenoithatatz.com/thiet-ke/thiet-ke-noi-t 
hat-chung-cu-grand-sapphire -gs-can-1-phong-ngu-hien-dai 
/|https://thietkenoithatatz.com/thiet-ke/thiet-ke-noi-that-chung-cu-grand-sapphire-gs-can-1-phong-ngu-hien-dai/]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-18284) CMake cannot find package configuration file in CMAKE_MODULE_PATH

[jira] [Created] (ARROW-18283) [R] Update Arrow for R cheatsheet to include GCS

[jira] [Created] (ARROW-18282) [C++][Python] Support step slicing in list_slice kernel

[jira] [Created] (ARROW-18281) [C++][Python] Support start == stop in list_slice kernel

[jira] [Created] (ARROW-18280) [C++][Python] Support slicing to arbitrary end in list_slice kernel

[jira] [Created] (ARROW-18279) [C++][Python] Implement HashAggregate UDF

[jira] [Created] (ARROW-18278) [Java] Maven generate-libs-jni-macos-linux on M1 fails due to cmake error

[jira] [Created] (ARROW-18277) Unable to install R's arrow on RStudio

[jira] [Created] (ARROW-18276) Reading from hdfs using pyarrow 10.0.0 throws OSError: [Errno 22] Opening HDFS file

[jira] [Created] (ARROW-18275) Allow custom reader/writer implementation for arrow dataset read/write path

[jira] [Created] (ARROW-18274) [Go] Sparse union of structs is buggy

[jira] [Created] (ARROW-18273) For extension types, compute kernels should default to storage types?

[GitHub] [arrow-julia] alex-s-gardner opened a new issue, #359: Arrow changes data type from input in unexpected ways

[jira] [Created] (ARROW-18272) [pyarrow] ParquetFile does not recognize GCS cloud path as a string

[jira] [Created] (ARROW-18271) [C++] Remove GlobalForkSafeMutex

[jira] [Created] (ARROW-18270) [Python] Remove gcc 4.9 compatibility code

[jira] [Created] (ARROW-18269) Slash character in partition value handling

[jira] [Created] (ARROW-18268) [Poss]

[jira] [Created] (ARROW-18267) [R] Possible bug in Handling Blank Conversion to Missing Value

[jira] [Created] (ARROW-18266) [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one

[jira] [Created] (ARROW-18265) [C++] Allow FieldPath to work with ListElement

[jira] [Created] (ARROW-18264) [Python] Add Time64Scalar.value field

[jira] [Created] (ARROW-18263) [R] Error when trying to write POSIXlt data to CSV

[jira] [Created] (ARROW-18262) [Archery][CI] New version of pygit2 fails to import and make archery commands to fail

[jira] [Created] (ARROW-18261) Mẫu thiết kế nội thất chung cư Grand Sapphire-GS căn 1 phòng ngủ cho anh Minh

25 matches

Site Navigation

Mail list logo

Footer information