[jira] [Created] (ARROW-10561) [Rust] Simplify `MutableBuffer::write` and `MutableBuffer::write_bytes`
Jorge Leitão created ARROW-10561: Summary: [Rust] Simplify `MutableBuffer::write` and `MutableBuffer::write_bytes` Key: ARROW-10561 URL: https://issues.apache.org/jira/browse/ARROW-10561 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Leitão Assignee: Jorge Leitão -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10560) [Python] Crash when creating array with string over 2GB
Antoine Pitrou created ARROW-10560: -- Summary: [Python] Crash when creating array with string over 2GB Key: ARROW-10560 URL: https://issues.apache.org/jira/browse/ARROW-10560 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou {code:python} >>> import pyarrow as pa >>> data = [b"x" * (1<<32)] >>> arr = pa.array(data) Erreur de segmentation {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10559) [Rust] [DataFusion] Break up logical_plan/mod.rs into smaller modules
Andrew Lamb created ARROW-10559: --- Summary: [Rust] [DataFusion] Break up logical_plan/mod.rs into smaller modules Key: ARROW-10559 URL: https://issues.apache.org/jira/browse/ARROW-10559 Project: Apache Arrow Issue Type: Improvement Reporter: Andrew Lamb Assignee: Andrew Lamb The module has gotten fairly large and so refactoring it into smaller chunks will improve readability -- as suggested by Jorge https://github.com/apache/arrow/pull/8619#pullrequestreview-527391221 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10558) [Python] Filesystem S3 tests not independent (native s3 influences s3fs)
Joris Van den Bossche created ARROW-10558: - Summary: [Python] Filesystem S3 tests not independent (native s3 influences s3fs) Key: ARROW-10558 URL: https://issues.apache.org/jira/browse/ARROW-10558 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The filesystem tests in {{test_fs.py}} that are parametrized with all the tested filesystems have some "state" shared between them, at least in the case of S3. When first a test is run with our own S3FileSystem, which eg creates a directory, this directory is still present when we test the s3fs wrapped filesystem, which causes some tests to pass that would otherwise fail if run in isolation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10557) [C++] Add scalar string slicing/substring kernel
Maarten Breddels created ARROW-10557: Summary: [C++] Add scalar string slicing/substring kernel Key: ARROW-10557 URL: https://issues.apache.org/jira/browse/ARROW-10557 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels This should implement slicing scalar string values of strings arrays with Python semantics with start, stop ,step arguments. This may seem similar to lists, or binary array, but the string length semantics enter into this kernel, which does not need to equal the number of bytes, nor the number of codepoints (accents, etc should be skipped). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10556) [C++] Caching pre computed data based on FunctionOptions in the kernel state
Maarten Breddels created ARROW-10556: Summary: [C++] Caching pre computed data based on FunctionOptions in the kernel state Key: ARROW-10556 URL: https://issues.apache.org/jira/browse/ARROW-10556 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels See discussion here: [https://github.com/apache/arrow/pull/8621#issuecomment-724796243] A kernel might need to pre-compute something based on the function options passed. Since the Kernel-FunctionOptions mapping is not 1-to-1, it does not make sense to store this in the function option object. Currently, match_substring calculates a `prefix_table` on each Exec call. In trim ([https://github.com/apache/arrow/pull/8621)] we compute a vector on each Exec call. This should be done only once and cached in the kernel state instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10555) Java API get negative messageLength
Litchy Soong created ARROW-10555: Summary: Java API get negative messageLength Key: ARROW-10555 URL: https://issues.apache.org/jira/browse/ARROW-10555 Project: Apache Arrow Issue Type: Bug Affects Versions: 1.0.1 Reporter: Litchy Soong when I call ArrowStreamReader.vectorSchemaRoot(), {{: (-520103681 < 0) 2020-11-09 07:09:07,033 ERROR com.intel.analytics.zoo.serving.PreProcessing - Error stack trace java.base/java.nio. Buffer.createCapacityException(Buffer.java:256) java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:347) com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:692) com.intel.analytics.zoo.shaded.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:57) com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:164) com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:170) com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:161)}} {{com.intel.analytics.zoo.shaded.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:63)}} {{}} {{}} {{The messageLength is negative, code is in MessageSerializer.java}} {{}} messageLength = MessageSerializer.bytesToInt(buffer.array());{{}} {{}} {{}} {{and error is raised in }} {{}} ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);{{}} {{}} I tried to use minimal reproduce code to reproduce the error but I could not reproduce it. So *any ideas that when can it get negative messageLength*? This error occurs occasionally in my program. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10554) [Rust] Default to non-cryptographic hash function?
Ritchie created ARROW-10554: --- Summary: [Rust] Default to non-cryptographic hash function? Key: ARROW-10554 URL: https://issues.apache.org/jira/browse/ARROW-10554 Project: Apache Arrow Issue Type: Wish Reporter: Ritchie THis isn't a major issue/ or even one at all. I was wondering how guys are looking towards using a non-cryptographic hash function (as SeaHash or Fnv) as default for Schema (and maybe other locations in crate). Arrow currently defaults to the default HashMap and HashSet. This uses a cryptographic hasher to guard you against DOS-attacks, which I believe isn't needed most of the time. This extra security has some performance overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005)