[jira] [Updated] (ARROW-10294) [Java] Resolve problems of DecimalVector APIs on ArrowBufs
[ https://issues.apache.org/jira/browse/ARROW-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-10294: Fix Version/s: (was: 2.0.0) 3.0.0 > [Java] Resolve problems of DecimalVector APIs on ArrowBufs > -- > > Key: ARROW-10294 > URL: https://issues.apache.org/jira/browse/ARROW-10294 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Unlike other fixed width vectors, DecimalVectors have some APIs that directly > manipulate an ArrowBuf (e.g. \{{void set(int index, int isSet, int start, > ArrowBuf buffer)}}). > After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they > work properly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9475) [Java] Clean up usages of BaseAllocator, use BufferAllocator instead
[ https://issues.apache.org/jira/browse/ARROW-9475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-9475: --- Fix Version/s: (was: 2.0.0) 3.0.0 > [Java] Clean up usages of BaseAllocator, use BufferAllocator instead > > > Key: ARROW-9475 > URL: https://issues.apache.org/jira/browse/ARROW-9475 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.17.0 >Reporter: Hongze Zhang >Assignee: Hongze Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Some classes' methods use BaseAllocator or cast BufferAllocator to > BaseAllocator internally instead of requiring for BufferAllocator directly, > e.g. codes in AllocationManager, BufferLedger. > This can be optimized by exposing necessary methods from BufferAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9475) [Java] Clean up usages of BaseAllocator, use BufferAllocator instead
[ https://issues.apache.org/jira/browse/ARROW-9475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-9475. Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 7768 [https://github.com/apache/arrow/pull/7768] > [Java] Clean up usages of BaseAllocator, use BufferAllocator instead > > > Key: ARROW-9475 > URL: https://issues.apache.org/jira/browse/ARROW-9475 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Affects Versions: 0.17.0 >Reporter: Hongze Zhang >Assignee: Hongze Zhang >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Some classes' methods use BaseAllocator or cast BufferAllocator to > BaseAllocator internally instead of requiring for BufferAllocator directly, > e.g. codes in AllocationManager, BufferLedger. > This can be optimized by exposing necessary methods from BufferAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10294) [Java] Resolve problems of DecimalVector APIs on ArrowBufs
[ https://issues.apache.org/jira/browse/ARROW-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-10294. - Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8455 [https://github.com/apache/arrow/pull/8455] > [Java] Resolve problems of DecimalVector APIs on ArrowBufs > -- > > Key: ARROW-10294 > URL: https://issues.apache.org/jira/browse/ARROW-10294 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Unlike other fixed width vectors, DecimalVectors have some APIs that directly > manipulate an ArrowBuf (e.g. \{{void set(int index, int isSet, int start, > ArrowBuf buffer)}}). > After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they > work properly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10236: --- Affects Version/s: 2.0.0 > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Affects Versions: 2.0.0 >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-10236. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 8460 [https://github.com/apache/arrow/pull/8460] > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
[ https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale updated ARROW-10236: --- Component/s: Rust > [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel > - > > Key: ARROW-10236 > URL: https://issues.apache.org/jira/browse/ARROW-10236 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andrew Lamb >Assignee: Andrew Lamb >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > There are plan time checks for valid type casts in DataFusion that are > designed to catch errors early before plan execution > Sadly the cast types that DataFusion thinks are valid is a significant subset > of what the arrow cast kernel supports. The goal of this ticket is to bring > DataFusion to parity with the type casting supported by arrow and allow > DataFusion to plan all casts that are supported by the arrow cast kernel > (I want this implicitly so when I add support for DictionaryArray casts in > Arrow they also are part of DataFusion) > Previously the notions of coercion and casting were somewhat conflated. I > have tried to clarify them in https://github.com/apache/arrow/pull/8399 as > well > For more detail, see > https://github.com/apache/arrow/pull/8340#discussion_r501257096 from > [~jorgecarleitao] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)
[ https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215162#comment-17215162 ] Neville Dipale commented on ARROW-10187: [~andygrove] 64-bit types and offsets would also be a blocker for supporting wasm32. If someone completes ARROW-9453, perhaps we can gauge from that on what effort it takes to support 32-bit. > [Rust] Test failures on 32 bit ARM (Raspberry Pi) > - > > Key: ARROW-10187 > URL: https://issues.apache.org/jira/browse/ARROW-10187 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > Perhaps these failures are to be expected and perhaps we can't really support > 32 bit? > > {code:java} > array::array::tests::test_primitive_array_from_vec stdout > thread 'array::array::tests::test_primitive_array_from_vec' panicked at > 'assertion failed: `(left == right)` > left: `144`, > right: `104`', arrow/src/array/array.rs:2383:9 > array::array::tests::test_primitive_array_from_vec_option stdout > thread 'array::array::tests::test_primitive_array_from_vec_option' panicked > at 'assertion failed: `(left == right)` > left: `224`, > right: `176`', arrow/src/array/array.rs:2409:9 > array::null::tests::test_null_array stdout > thread 'array::null::tests::test_null_array' panicked at 'assertion failed: > `(left == right)` > left: `64`, > right: `32`', arrow/src/array/null.rs:134:9 > array::union::tests::test_dense_union_i32 stdout > thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion > failed: `(left == right)` > left: `1024`, > right: `768`', arrow/src/array/union.rs:704:9 > memory::tests::test_allocate stdout > thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left > == right)` > left: `0`, > right: `32`', arrow/src/memory.rs:243:13 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215140#comment-17215140 ] Frank Du commented on ARROW-10321: -- For the second issue, yes, it should be better fixed in cmake side, I will proposal a patch soon. For the first issue, seems like the avx512 flag is supported but not full avx512 intrinsic APIs exist for this compiler. Currently we are simply using check_cxx_compiler_flag for the simd detection, it could be more accurate to use check_cxx_source_compiles to perform a try-build on a example SIMD code. > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215139#comment-17215139 ] Neal Richardson commented on ARROW-10321: - It may be similar to ARROW-9877 but that patch is included in 2.0.0, so that wasn't sufficient to fix this. In this case we're building on an older macOS SDK version than what Homebrew generally supports. > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215136#comment-17215136 ] Neal Richardson commented on ARROW-10321: - The X's next to the commits go to Travis builds that failed. Here's the first, where it builds with {{-DARROW_HAVE_RUNTIME_AVX512}} when it should not, and it fails: https://travis-ci.org/github/autobrew/homebrew-core/jobs/736104290#L723 Here's the second, where I set the runtime SIMD level to AVX2 so it didn't have that flag, yet it still tried to compile util/bpacking_avx512.cc: https://travis-ci.org/github/autobrew/homebrew-core/jobs/736173702#L725 Then you can see in the discussion on the PR I applied a patch that wrapped that whole file in {{#if defined(ARROW_HAVE_RUNTIME_AVX512)}}, and the next build compiled successfully > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10174) [Java] Reading of Dictionary encoded struct vector fails
[ https://issues.apache.org/jira/browse/ARROW-10174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan resolved ARROW-10174. -- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8363 [https://github.com/apache/arrow/pull/8363] > [Java] Reading of Dictionary encoded struct vector fails > - > > Key: ARROW-10174 > URL: https://issues.apache.org/jira/browse/ARROW-10174 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 1.0.1 >Reporter: Benjamin Wilhelm >Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > Write an index vector and a dictionary with a dictionary vector of the type > {{Struct}} using an {{ArrowStreamWriter}}. Reading this again fails with an > exception. > Code to reproduce: > {code:java} > final RootAllocator allocator = new RootAllocator(); > // Create the dictionary > final StructVector dict = StructVector.empty("Dict", allocator); > final NullableStructWriter dictWriter = dict.getWriter(); > final IntWriter dictA = dictWriter.integer("a"); > final IntWriter dictB = dictWriter.integer("b"); > for (int i = 0; i < 3; i++) { > dictWriter.start(); > dictA.writeInt(i); > dictB.writeInt(i); > dictWriter.end(); > } > dict.setValueCount(3); > final Dictionary dictionary = new Dictionary(dict, new DictionaryEncoding(1, > false, null)); > // Create the vector > final Random random = new Random(); > final StructVector vector = StructVector.empty("Dict", allocator); > final NullableStructWriter vectorWriter = vector.getWriter(); > final IntWriter vectorA = vectorWriter.integer("a"); > final IntWriter vectorB = vectorWriter.integer("b"); > for (int i = 0; i < 10; i++) { > int v = random.nextInt(3); > vectorWriter.start(); > vectorA.writeInt(v); > vectorB.writeInt(v); > vectorWriter.end(); > } > vector.setValueCount(10); > // Encode the vector using the dictionary > final IntVector indexVector = (IntVector) DictionaryEncoder.encode(vector, > dictionary); > // Write the vector to out > final ByteArrayOutputStream out = new ByteArrayOutputStream(); > final VectorSchemaRoot root = new > VectorSchemaRoot(Collections.singletonList(indexVector.getField()), > Collections.singletonList(indexVector)); > final ArrowStreamWriter writer = new ArrowStreamWriter(root, new > MapDictionaryProvider(dictionary), > Channels.newChannel(out)); > writer.start(); > writer.writeBatch(); > writer.end(); > // Read the vector from out > try (final ArrowStreamReader reader = new ArrowStreamReader(new > ByteArrayInputStream(out.toByteArray()), > allocator)) { > reader.loadNextBatch(); > final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot(); > final FieldVector readIndexVector = readRoot.getVector(0); > // Get the dictionary and decode > final Map readDictionaryMap = > reader.getDictionaryVectors(); > final Dictionary readDictionary = > readDictionaryMap.get(readIndexVector.getField().getDictionary().getId()); > final ValueVector readVector = > DictionaryEncoder.decode(readIndexVector, readDictionary); > } > {code} > Exception: > {code} > java.lang.IllegalArgumentException: not all nodes and buffers were consumed. > nodes: [ArrowFieldNode [length=3, nullCount=0], ArrowFieldNode [length=3, > nullCount=0]] buffers: [ArrowBuf[21], address:140118352739688, length:1, > ArrowBuf[22], address:140118352739696, length:12, ArrowBuf[23], > address:140118352739712, length:1, ArrowBuf[24], address:140118352739720, > length:12] > at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:63) > at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:241) > at > org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:232) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:129) > at com.knime.AppTest.testDictionaryStruct(AppTest.java:83) > {code} > If I see it corretly the error happens in > {{DictionaryUtilities#toMessageFormat}}. If a dictionary encoded vector is > encountered still the children of the memory format field are used (none > because this is Int). However, the children of the field of the dictionary > vector should be mapped to the message format and set as children. > I can create a fix and open a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10323) [Release][wheel] Add missing verification setup step
[ https://issues.apache.org/jira/browse/ARROW-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10323: --- Labels: pull-request-available (was: ) > [Release][wheel] Add missing verification setup step > > > Key: ARROW-10323 > URL: https://issues.apache.org/jira/browse/ARROW-10323 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215117#comment-17215117 ] Yibo Cai commented on ARROW-10321: -- Maybe similar problem as ARROW-9877? I didn't find build logs at the autobrew PR link. > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10323) [Release][wheel] Add missing verification setup step
Kouhei Sutou created ARROW-10323: Summary: [Release][wheel] Add missing verification setup step Key: ARROW-10323 URL: https://issues.apache.org/jira/browse/ARROW-10323 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10322) [C++][Dataset] Minimize Expression to a wrapper around compute::Function
Ben Kietzman created ARROW-10322: Summary: [C++][Dataset] Minimize Expression to a wrapper around compute::Function Key: ARROW-10322 URL: https://issues.apache.org/jira/browse/ARROW-10322 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 3.0.0 The Expression class hierarchy was originally intended to provide generic, structured representations of compute functionality. On the former point they have been superseded by compute::{Function, Kernel, ...} which encapsulates validation and execution. In light of this Expression can be drastically simplified and improved by composition with these classes. Each responsibility which can be deferred implies less boilerplate when exposing a new compute function for use in datasets. Ideally any compute function will be immediately available to use in a filter or projection. {code} struct Expression { using Literal = std::shared_ptr; struct Projection { std::vector names std::vector values; }; struct Call { std::shared_ptr function; std::shared_ptr options; std::vector arguments; }; util::variant value; }; {code} A simple discriminated union as above should be sufficient to represent arbitrary filters and projections: any expression which results in type {{bool}} is a valid filter, and any expression which is a {{Projection}} may be used to map one record batch to another. Expression simplification (currently implemented in {{Expression::Assume}}) is an optimization used for example in predicate pushdown, and therefore need not exhaustively cover the full space of available compute functions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
[ https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10106: --- Labels: flight pull-request-available (was: flight) > [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener > --- > > Key: ARROW-10106 > URL: https://issues.apache.org/jira/browse/ARROW-10106 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: James Duong >Assignee: James Duong >Priority: Major > Labels: flight, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > OutboundStreamListener has a method isReady() that FlightProducers need to > poll during implementations of getStream() to avoid buffering too much data. > An enhancement would be to allow setting a callback to run (for example, > notifying a CountdownLatch) so that FlightProducer implementations don't need > to busy wait. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215082#comment-17215082 ] Neal Richardson commented on ARROW-10321: - Sure, though all of the build logs and discussion is available on the PR I linked. > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
[ https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Duong updated ARROW-10106: Labels: flight (was: ) > [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener > --- > > Key: ARROW-10106 > URL: https://issues.apache.org/jira/browse/ARROW-10106 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: James Duong >Priority: Major > Labels: flight > > OutboundStreamListener has a method isReady() that FlightProducers need to > poll during implementations of getStream() to avoid buffering too much data. > An enhancement would be to allow setting a callback to run (for example, > notifying a CountdownLatch) so that FlightProducer implementations don't need > to busy wait. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
[ https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Duong reassigned ARROW-10106: --- Assignee: James Duong > [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener > --- > > Key: ARROW-10106 > URL: https://issues.apache.org/jira/browse/ARROW-10106 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Java >Reporter: James Duong >Assignee: James Duong >Priority: Major > Labels: flight > > OutboundStreamListener has a method isReady() that FlightProducers need to > poll during implementations of getStream() to avoid buffering too much data. > An enhancement would be to allow setting a callback to run (for example, > notifying a CountdownLatch) so that FlightProducer implementations don't need > to busy wait. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not
[ https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215069#comment-17215069 ] Antoine Pitrou commented on ARROW-10321: > The check for AVX512 support was returning true when in fact the compiler did > not support it You should point to the logs so that we can look at the symptoms. > [C++] Building AVX512 code when we should not > - > > Key: ARROW-10321 > URL: https://issues.apache.org/jira/browse/ARROW-10321 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Neal Richardson >Priority: Major > > On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging > Arrow for an old macOS SDK version, we found what I believe are 2 different > problems: > 1. The check for AVX512 support was returning true when in fact the compiler > did not support it > 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it > was still trying to compile one of the AVX512 files, which failed. I added a > patch that made that file conditional, but there's probably a proper cmake > way to tell it not to compile that file at all > cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10321) [C++] Building AVX512 code when we should not
Neal Richardson created ARROW-10321: --- Summary: [C++] Building AVX512 code when we should not Key: ARROW-10321 URL: https://issues.apache.org/jira/browse/ARROW-10321 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Neal Richardson On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging Arrow for an old macOS SDK version, we found what I believe are 2 different problems: 1. The check for AVX512 support was returning true when in fact the compiler did not support it 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it was still trying to compile one of the AVX512 files, which failed. I added a patch that made that file conditional, but there's probably a proper cmake way to tell it not to compile that file at all cc [~yibo] [~apitrou] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9747) [C++][Java][Format] Support Decimal256 Type
[ https://issues.apache.org/jira/browse/ARROW-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-9747: -- Labels: pull-request-available (was: ) > [C++][Java][Format] Support Decimal256 Type > --- > > Key: ARROW-9747 > URL: https://issues.apache.org/jira/browse/ARROW-9747 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format, Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10308) [Python] read_csv from python is slow on some work loads
[ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10308: - Summary: [Python] read_csv from python is slow on some work loads (was: read_csv from python is slow on some work loads) > [Python] read_csv from python is slow on some work loads > > > Key: ARROW-10308 > URL: https://issues.apache.org/jira/browse/ARROW-10308 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 1.0.1 > Environment: Machine: Azure, 48 vcpus, 384GiB ram > OS: Ubuntu 18.04 > Dockerfile and script: attached, or here: > https://github.com/drorspei/arrow-csv-benchmark >Reporter: Dror Speiser >Priority: Minor > Labels: csv, performance > Attachments: Dockerfile, arrow-csv-benchmark-plot.png, > arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, > profile3.svg, profile4.svg > > > Hi! > I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, > processing data around 0.5GiB/s. "Real workloads" means many string, float, > and all-null columns, and large file size (5-10GiB), though the file size > didn't matter to much. > Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of > the time is spent on shared pointer lock mechanisms (though I'm not sure if > this is to be trusted). I've attached the dumps in svg format. > I've also attached a script and a Dockerfile to run a benchmark, which > reproduces the speeds I see. Building the docker image and running it on a > large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly > around 0.5GiB/s. > This is all also available here: > https://github.com/drorspei/arrow-csv-benchmark -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10276) [Python] Armv7 orc and flight not supported for build. Compat error on using with spark
[ https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-10276: - Summary: [Python] Armv7 orc and flight not supported for build. Compat error on using with spark (was: Armv7 orc and flight not supported for build. Compat error on using with spark) > [Python] Armv7 orc and flight not supported for build. Compat error on using > with spark > --- > > Key: ARROW-10276 > URL: https://issues.apache.org/jira/browse/ARROW-10276 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.17.0 >Reporter: utsav >Priority: Major > Attachments: arrow_compat_error, build_pip_wheel.sh, > dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh > > > I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have > tried to use it for the raspberry pi 3 without luck in previous posts. > I figured out how to successfully build it for armv7 using the script below > but cannot use orc and flight flags. People had looked into it in ARROW-8420 > but I don't know if they faced these issues. > I tried converting a spark dataframe to pandas using pyarrow but now it > complains about a compat feature. I have attached images below > Any help would be appreciated. Thanks > Spark Version: 2.4.5. > The code is as follows: > ``` > import pandas as pd > df_pd = df.toPandas() > npArr = df_pd.to_numpy() > ``` > The error is as follows:- > ``` > /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas > attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is > set to true; however, failed by the reason below: > module 'pyarrow' has no attribute 'compat' > Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' > is set to true. > warnings.warn(msg) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10301) [C++] Add "all" boolean reducing kernel
[ https://issues.apache.org/jira/browse/ARROW-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10301: --- Labels: analytics pull-request-available (was: analytics) > [C++] Add "all" boolean reducing kernel > --- > > Key: ARROW-10301 > URL: https://issues.apache.org/jira/browse/ARROW-10301 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Andrew Wieteska >Assignee: Andrew Wieteska >Priority: Major > Labels: analytics, pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > As discussed on GitHub: > [https://github.com/apache/arrow/pull/8294#discussion_r504034461] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10145) [C++][Dataset] Assert integer overflow in partitioning falls back to string
[ https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-10145: -- Fix Version/s: (was: 2.0.0) 3.0.0 > [C++][Dataset] Assert integer overflow in partitioning falls back to string > --- > > Key: ARROW-10145 > URL: https://issues.apache.org/jira/browse/ARROW-10145 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset > Small reproducer: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'part': [3760212050]*10, 'col': range(10)}) > pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) > In [35]: pq.read_table("test_int64_partition/") > ... > ArrowInvalid: error parsing '3760212050' as scalar of type int32 > In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) > In ../src/arrow/dataset/partition.cc, line 218, code: > (_error_or_value26).status() > In ../src/arrow/dataset/partition.cc, line 229, code: > (_error_or_value27).status() > In ../src/arrow/dataset/discovery.cc, line 256, code: > (_error_or_value17).status() > In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) > Out[36]: > pyarrow.Table > col: int64 > part: dictionary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-10145) [C++][Dataset] Assert integer overflow in partitioning falls back to string
[ https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-10145. --- Fix Version/s: (was: 3.0.0) 2.0.0 Resolution: Fixed Issue resolved by pull request 8462 [https://github.com/apache/arrow/pull/8462] > [C++][Dataset] Assert integer overflow in partitioning falls back to string > --- > > Key: ARROW-10145 > URL: https://issues.apache.org/jira/browse/ARROW-10145 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 2.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > From > https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset > Small reproducer: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'part': [3760212050]*10, 'col': range(10)}) > pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part']) > In [35]: pq.read_table("test_int64_partition/") > ... > ArrowInvalid: error parsing '3760212050' as scalar of type int32 > In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this) > In ../src/arrow/dataset/partition.cc, line 218, code: > (_error_or_value26).status() > In ../src/arrow/dataset/partition.cc, line 229, code: > (_error_or_value27).status() > In ../src/arrow/dataset/discovery.cc, line 256, code: > (_error_or_value17).status() > In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True) > Out[36]: > pyarrow.Table > col: int64 > part: dictionary > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10320) Convert RecordBatchIterator to a Stream
[ https://issues.apache.org/jira/browse/ARROW-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10320: --- Labels: pull-request-available (was: ) > Convert RecordBatchIterator to a Stream > --- > > Key: ARROW-10320 > URL: https://issues.apache.org/jira/browse/ARROW-10320 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Jorge Leitão >Assignee: Jorge Leitão >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > So that we the unit of work is a single record batch instead of a part of a > partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10320) Convert RecordBatchIterator to a Stream
Jorge Leitão created ARROW-10320: Summary: Convert RecordBatchIterator to a Stream Key: ARROW-10320 URL: https://issues.apache.org/jira/browse/ARROW-10320 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Leitão Assignee: Jorge Leitão Fix For: 3.0.0 So that we the unit of work is a single record batch instead of a part of a partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-9479) [JS] Table.from fails for zero-item Lists, FixedSizeLists, Maps. ditto Table.empty
[ https://issues.apache.org/jira/browse/ARROW-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette resolved ARROW-9479. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 7771 [https://github.com/apache/arrow/pull/7771] > [JS] Table.from fails for zero-item Lists, FixedSizeLists, Maps. ditto > Table.empty > -- > > Key: ARROW-9479 > URL: https://issues.apache.org/jira/browse/ARROW-9479 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.17.1 >Reporter: Nicholas Roberts >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > deserializing zero-item tables (as generated by Table.empty or, in this case, > pyarrow.Schema.serialize) with a schema containing a List, FixedList or Map > fail due to an unconditional > {code:java} > new Data(/* preceding parameters */ buffers, [childData]){code} > statement, the childData parameter resolves to [undefined] rather than the > desired []. > See [https://github.com/apache/arrow/pull/7771] for further details. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10319) [Flight][Go] Add Context to Client Auth Handler functions for Flight
[ https://issues.apache.org/jira/browse/ARROW-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10319: --- Labels: pull-request-available (was: ) > [Flight][Go] Add Context to Client Auth Handler functions for Flight > > > Key: ARROW-10319 > URL: https://issues.apache.org/jira/browse/ARROW-10319 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC, Go >Reporter: Matt Topol >Assignee: Matt Topol >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > During my usage I found that if i wanted to reuse an existing flight client > that required authentication, it was difficult to reuse the auth handler > since there wasn't a way to tell which goroutine / which auth made a > particular request. By passing the context to the client auth handler it > allows passing information to the auth handler via the context which could > then be utilized by consumers in order to reuse a auth handler so that an > entire flight client could be shared across multiple goroutines if desired. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10319) [Flight][Go] Add Context to Client Auth Handler functions for Flight
Matt Topol created ARROW-10319: -- Summary: [Flight][Go] Add Context to Client Auth Handler functions for Flight Key: ARROW-10319 URL: https://issues.apache.org/jira/browse/ARROW-10319 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Go Reporter: Matt Topol Assignee: Matt Topol During my usage I found that if i wanted to reuse an existing flight client that required authentication, it was difficult to reuse the auth handler since there wasn't a way to tell which goroutine / which auth made a particular request. By passing the context to the client auth handler it allows passing information to the auth handler via the context which could then be utilized by consumers in order to reuse a auth handler so that an entire flight client could be shared across multiple goroutines if desired. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8113) [C++] Implement a lighter-weight variant
[ https://issues.apache.org/jira/browse/ARROW-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8113: -- Labels: pull-request-available (was: ) > [C++] Implement a lighter-weight variant > > > Key: ARROW-8113 > URL: https://issues.apache.org/jira/browse/ARROW-8113 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > {{util::variant}} is an extremely useful structure but its header slows > compilation significantly, so using it in public headers is questionable > https://github.com/apache/arrow/pull/6545#discussion_r388406246 > I'll try writing a lighter weight version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10318) [C++] Use pimpl idiom in CSV parser
Antoine Pitrou created ARROW-10318: -- Summary: [C++] Use pimpl idiom in CSV parser Key: ARROW-10318 URL: https://issues.apache.org/jira/browse/ARROW-10318 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8845) [C++] Selective compression on the wire
[ https://issues.apache.org/jira/browse/ARROW-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8845: -- Component/s: C++ > [C++] Selective compression on the wire > --- > > Key: ARROW-8845 > URL: https://issues.apache.org/jira/browse/ARROW-8845 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: Amol Umbarkar >Priority: Major > > Dask seems to be selectively do compression if it is found to be useful. They > sort of pick 10kb of sample upfront to calculate compression and if the > results are good then the whole batch is compressed. This seems to save > de-compression effort on receiver side. > > Please take a look at > [https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression] > > Thought this could be relevant to arrow batch transfers as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8845) [C++] Selective compression on the wire
[ https://issues.apache.org/jira/browse/ARROW-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-8845: -- Summary: [C++] Selective compression on the wire (was: [c++] Selective compression on the wire) > [C++] Selective compression on the wire > --- > > Key: ARROW-8845 > URL: https://issues.apache.org/jira/browse/ARROW-8845 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: Amol Umbarkar >Priority: Major > > Dask seems to be selectively do compression if it is found to be useful. They > sort of pick 10kb of sample upfront to calculate compression and if the > results are good then the whole batch is compressed. This seems to save > de-compression effort on receiver side. > > Please take a look at > [https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression] > > Thought this could be relevant to arrow batch transfers as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10314) [C++] CSV wrong row number in error message
[ https://issues.apache.org/jira/browse/ARROW-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214835#comment-17214835 ] Antoine Pitrou commented on ARROW-10314: Unfortunately, when reading in multi-threaded mode, several blocks are parsing at once, and you don't know how many rows the previous blocks contain. We would have to keep the error in memory for later, until the block length is resolved. > [C++] CSV wrong row number in error message > --- > > Key: ARROW-10314 > URL: https://issues.apache.org/jira/browse/ARROW-10314 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.1 >Reporter: Maciej >Priority: Major > > When I try to read CSV file with wrong data, I get message like: > {code:java} > CSV file reader error: Invalid: In CSV column #0: CSV conversion error to > timestamp[s]: invalid value '1' > {code} > Would be very helpful to add information about row with wrong data e.g. > {code:java} > CSV file reader error: Invalid: In CSV column #0 line number #123456: CSV > conversion error to timestamp[s]: invalid value '1' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10315) [C++] CSV skip wrong rows
[ https://issues.apache.org/jira/browse/ARROW-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214832#comment-17214832 ] Antoine Pitrou commented on ARROW-10315: Skipping rows entirely will be difficult. We could add an option to emit nulls in that case, though. What do you think? > [C++] CSV skip wrong rows > - > > Key: ARROW-10315 > URL: https://issues.apache.org/jira/browse/ARROW-10315 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 1.0.1 >Reporter: Maciej >Priority: Major > > It would be helpful to add another option to {color:#267f99}ReadOptions > {color}which will enable skipping rows with wrong data (e.g. data type > mismatch with column type) and continue reading next rows. Wrong rows numbers > may be reported at the end of processing. > This way I can deal with the wrongly formatted data or ignore it if I have a > large load success rate and I don’t care about the exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10317) [C++] Consider adding documentation for FunctionOption classes
Antoine Pitrou created ARROW-10317: -- Summary: [C++] Consider adding documentation for FunctionOption classes Key: ARROW-10317 URL: https://issues.apache.org/jira/browse/ARROW-10317 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 3.0.0 This would allow generating improved documentation for bindings (e.g. Python). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10316) [Python] Consider using __wrapped__ for compute function introspection
Antoine Pitrou created ARROW-10316: -- Summary: [Python] Consider using __wrapped__ for compute function introspection Key: ARROW-10316 URL: https://issues.apache.org/jira/browse/ARROW-10316 Project: Apache Arrow Issue Type: Task Components: Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou As suggested by [~bkietz] here: https://github.com/apache/arrow/pull/8457#discussion_r504966207 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10316) [Python] Consider using __wrapped__ for compute function introspection
[ https://issues.apache.org/jira/browse/ARROW-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-10316: --- Fix Version/s: 3.0.0 > [Python] Consider using __wrapped__ for compute function introspection > -- > > Key: ARROW-10316 > URL: https://issues.apache.org/jira/browse/ARROW-10316 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Fix For: 3.0.0 > > > As suggested by [~bkietz] here: > https://github.com/apache/arrow/pull/8457#discussion_r504966207 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4804) [Rust] Read temporal values from CSV
[ https://issues.apache.org/jira/browse/ARROW-4804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-4804: -- Labels: beginner (was: ) > [Rust] Read temporal values from CSV > > > Key: ARROW-4804 > URL: https://issues.apache.org/jira/browse/ARROW-4804 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Affects Versions: 0.12.0 >Reporter: Neville Dipale >Priority: Major > Labels: beginner > > CSV reader should support reading temporal values. > Should support timestamp, date and time, with sane defaults provided for > schema inference. > To keep inference performant. user should provide a Vec of which > columns to try convert to a temporal array -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4803) [Rust] Read temporal values from JSON
[ https://issues.apache.org/jira/browse/ARROW-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-4803: -- Labels: beginner (was: ) > [Rust] Read temporal values from JSON > - > > Key: ARROW-4803 > URL: https://issues.apache.org/jira/browse/ARROW-4803 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Affects Versions: 0.12.0 >Reporter: Neville Dipale >Priority: Major > Labels: beginner > > Ability to parse strings that look like timestamps to timestamp type. Need to > consider whether only timestamp type should be supported as most JSON > libraries stick to ISO8601. It might also be inefficient to use regex for > timestamps, so the user should provide a hint of which columns to convert to > timestamps -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9911) [Rust][DataFusion] SELECT with no FROM clause should produce a single row of output
[ https://issues.apache.org/jira/browse/ARROW-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-9911: -- Labels: beginner (was: ) > [Rust][DataFusion] SELECT with no FROM clause should produce a > single row of output > > > Key: ARROW-9911 > URL: https://issues.apache.org/jira/browse/ARROW-9911 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Priority: Minor > Labels: beginner > > This is somewhat of a special case, but it is useful for demonstration / > testing expressions. > A select expression with no where clause, such as "select 1" should produce a > single row. Today datafusion accepts the query but produces no rows. > Actual output: > {code} > arrow/rust$ cargo run --release --bin datafusion-cli > Finished release [optimized] target(s) in 0.25s > Running `target/release/datafusion-cli` > > select 1 ; > 0 rows in set. Query took 0 seconds. > {code} > Expected output is a single row, with the value 1. Here is an example using > SQLLite : > {code} > $ sqlite3 > SQLite version 3.28.0 2019-04-15 14:49:49 > Enter ".help" for usage hints. > Connected to a transient in-memory database. > Use ".open FILENAME" to reopen on a persistent database. > sqlite> select 1; > 1 > sqlite> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-9911) [Rust][DataFusion] SELECT with no FROM clause should produce a single row of output
[ https://issues.apache.org/jira/browse/ARROW-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove updated ARROW-9911: -- Component/s: Rust - DataFusion Rust > [Rust][DataFusion] SELECT with no FROM clause should produce a > single row of output > > > Key: ARROW-9911 > URL: https://issues.apache.org/jira/browse/ARROW-9911 > Project: Apache Arrow > Issue Type: Bug > Components: Rust, Rust - DataFusion >Reporter: Andrew Lamb >Priority: Minor > > This is somewhat of a special case, but it is useful for demonstration / > testing expressions. > A select expression with no where clause, such as "select 1" should produce a > single row. Today datafusion accepts the query but produces no rows. > Actual output: > {code} > arrow/rust$ cargo run --release --bin datafusion-cli > Finished release [optimized] target(s) in 0.25s > Running `target/release/datafusion-cli` > > select 1 ; > 0 rows in set. Query took 0 seconds. > {code} > Expected output is a single row, with the value 1. Here is an example using > SQLLite : > {code} > $ sqlite3 > SQLite version 3.28.0 2019-04-15 14:49:49 > Enter ".help" for usage hints. > Connected to a transient in-memory database. > Use ".open FILENAME" to reopen on a persistent database. > sqlite> select 1; > 1 > sqlite> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214762#comment-17214762 ] Neal Richardson commented on ARROW-10305: - In terms of compute kernels, I think at least some of the pattern matching and extraction is happening in ARROW-10195. Another missing piece, which [~bkietz] was writing up a JIRA for, is being able to create dataset expressions that call any arbitrary compute function. > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering a dataset (after open_datatset() ). Specifically, > the code below : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10315) [C++] CSV skip wrong rows
Maciej created ARROW-10315: -- Summary: [C++] CSV skip wrong rows Key: ARROW-10315 URL: https://issues.apache.org/jira/browse/ARROW-10315 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.1 Reporter: Maciej It would be helpful to add another option to {color:#267f99}ReadOptions {color}which will enable skipping rows with wrong data (e.g. data type mismatch with column type) and continue reading next rows. Wrong rows numbers may be reported at the end of processing. This way I can deal with the wrongly formatted data or ignore it if I have a large load success rate and I don’t care about the exceptions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10314) [C++] CSV wrong row number in error message
Maciej created ARROW-10314: -- Summary: [C++] CSV wrong row number in error message Key: ARROW-10314 URL: https://issues.apache.org/jira/browse/ARROW-10314 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 1.0.1 Reporter: Maciej When I try to read CSV file with wrong data, I get message like: {code:java} CSV file reader error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value '1' {code} Would be very helpful to add information about row with wrong data e.g. {code:java} CSV file reader error: Invalid: In CSV column #0 line number #123456: CSV conversion error to timestamp[s]: invalid value '1' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion
[ https://issues.apache.org/jira/browse/ARROW-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10313: --- Labels: pull-request-available (was: ) > [C++] Improve UTF8 validation speed and CSV string conversion > - > > Key: ARROW-10313 > URL: https://issues.apache.org/jira/browse/ARROW-10313 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV > string conversion. > This is because we must validate many small UTF8 strings individually. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion
Antoine Pitrou created ARROW-10313: -- Summary: [C++] Improve UTF8 validation speed and CSV string conversion Key: ARROW-10313 URL: https://issues.apache.org/jira/browse/ARROW-10313 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 3.0.0 Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV string conversion. This is because we must validate many small UTF8 strings individually. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.
[ https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10122: --- Labels: pull-request-available (was: ) > [Python] Selecting one column of multi-index results in a duplicated value > column. > -- > > Key: ARROW-10122 > URL: https://issues.apache.org/jira/browse/ARROW-10122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 > Environment: arrow 1.0.1 > parquet 1.5.1 > pandas 1.1.0 > pyarrow 1.0.1 >Reporter: Troy Zimmerman >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When I read one column of a multi-index, that column is duplicated as a value > column in the resulting Pandas data frame. > {code:python} > >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), > >> "value": np.arange(5)}) > >>> df = table.to_pandas().set_index(["first", "second"]) > >>> print(df) > value > first second > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4 > >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet") > >>> data = ds.dataset("/tmp/test.parquet") > {code} > This works as expected, as does selecting all or no columns. > {code:python} > >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas()) > value > first second > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4 > {code} > This does not work as expected, as the {{first}} column is both an index and > a value. > {code:python} > >>> print(data.to_table(columns=["first", "value"]).to_pandas()) >first value > first > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4{code} > This is easy to workaround by specifying the full multi-index in > {{to_table}}, but does this behavior make sense? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.
[ https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214652#comment-17214652 ] Joris Van den Bossche commented on ARROW-10122: --- A reproducer without involving parquet or dataset: {code} In [28]: table = pa.table({"first": list(range(5)), "second": list(range(5)), "value": np.arange(5)}) In [29]: df = table.to_pandas().set_index(["first", "second"]) In [30]: table = pa.Table.from_pandas(df) In [31]: table.to_pandas() Out[31]: value first second 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 In [32]: table.select(["first", "value"]).to_pandas() Out[32]: first value first 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 {code} > [Python] Selecting one column of multi-index results in a duplicated value > column. > -- > > Key: ARROW-10122 > URL: https://issues.apache.org/jira/browse/ARROW-10122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 1.0.1 > Environment: arrow 1.0.1 > parquet 1.5.1 > pandas 1.1.0 > pyarrow 1.0.1 >Reporter: Troy Zimmerman >Priority: Minor > > When I read one column of a multi-index, that column is duplicated as a value > column in the resulting Pandas data frame. > {code:python} > >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), > >> "value": np.arange(5)}) > >>> df = table.to_pandas().set_index(["first", "second"]) > >>> print(df) > value > first second > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4 > >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet") > >>> data = ds.dataset("/tmp/test.parquet") > {code} > This works as expected, as does selecting all or no columns. > {code:python} > >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas()) > value > first second > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4 > {code} > This does not work as expected, as the {{first}} column is both an index and > a value. > {code:python} > >>> print(data.to_table(columns=["first", "value"]).to_pandas()) >first value > first > 0 0 0 > 1 1 1 > 2 2 2 > 3 3 3 > 4 4 4{code} > This is easy to workaround by specifying the full multi-index in > {{to_table}}, but does this behavior make sense? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10312) Implement unix_timestamp function in gandiva
Naman Udasi created ARROW-10312: --- Summary: Implement unix_timestamp function in gandiva Key: ARROW-10312 URL: https://issues.apache.org/jira/browse/ARROW-10312 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Naman Udasi -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10311) [Release] Update crossbow verification process
[ https://issues.apache.org/jira/browse/ARROW-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10311: --- Labels: pull-request-available (was: ) > [Release] Update crossbow verification process > -- > > Key: ARROW-10311 > URL: https://issues.apache.org/jira/browse/ARROW-10311 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The automatized crossbow RC verification tasks needs to be updated since > multiple builds are failing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10311) [Release] Update crossbow verification process
Krisztian Szucs created ARROW-10311: --- Summary: [Release] Update crossbow verification process Key: ARROW-10311 URL: https://issues.apache.org/jira/browse/ARROW-10311 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Fix For: 3.0.0 The automatized crossbow RC verification tasks needs to be updated since multiple builds are failing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10306) [C++] Add string replacement kernel
[ https://issues.apache.org/jira/browse/ARROW-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10306: --- Labels: pull-request-available (was: ) > [C++] Add string replacement kernel > > > Key: ARROW-10306 > URL: https://issues.apache.org/jira/browse/ARROW-10306 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Maarten Breddels >Assignee: Maarten Breddels >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Similar to > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html] > with a plain variant, and optionally a RE2 version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10310) [C++][Gandiva] Add single argument round() in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-10310: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Add single argument round() in Gandiva > - > > Key: ARROW-10310 > URL: https://issues.apache.org/jira/browse/ARROW-10310 > Project: Apache Arrow > Issue Type: Task > Components: C++ - Gandiva >Reporter: Sagnik Chakraborty >Assignee: Sagnik Chakraborty >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214575#comment-17214575 ] Joris Van den Bossche commented on ARROW-10305: --- [~palgal] Thanks for opening the issue. Such a substring matching filter is indeed not yet implemented. A first step to enable this, would be to have a "compute kernel" for substrings (from looking at the overview at https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst, I don't think we currently have functionality to create such substrings). A related compute kernel is actually {{match_substring}} with which you could check that (using your example) "a" is present in the string. But, that doesn't easily guarantee anything about the position of the substring in the string (although with a regular expression pattern, you could achieve this in some ways). Then, a second step would be to be able to "express" such a compute kernel in an Expression that can be used to filter the dataset (although this might not be needed for the dplyr syntax? It could maybe also be done with an actual compute filter kernel? cc [~npr]?). > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering a dataset (after open_datatset() ). Specifically, > the code below : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
[ https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maarten Breddels reassigned ARROW-9128: --- Assignee: Maarten Breddels > [C++] Implement string space trimming kernels: trim, ltrim, and rtrim > - > > Key: ARROW-9128 > URL: https://issues.apache.org/jira/browse/ARROW-9128 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Maarten Breddels >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10310) [C++][Gandiva] Add single argument round() in Gandiva
Sagnik Chakraborty created ARROW-10310: -- Summary: [C++][Gandiva] Add single argument round() in Gandiva Key: ARROW-10310 URL: https://issues.apache.org/jira/browse/ARROW-10310 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Sagnik Chakraborty Assignee: Sagnik Chakraborty -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions
[ https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pal updated ARROW-10305: Description: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering a dataset (after open_datatset() ). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. was: Hi, Some expressions, such as substr(), grepl(), str_detect() or others, are not supported while filtering after open_datatset(). Specifically, the code below : {code:java} library(dplyr) library(arrow) data = data.frame(a = c("a", "a2", "a3")) write_parquet(data, "Test_filter/data.parquet") ds <- open_dataset("Test_filter/") data_flt <- ds %>% filter(substr(a, 1, 1) == "a") {code} gives this error : {code:java} Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == "a" Call collect() first to pull data into R.{code} These expressions may be very helpful, not to say necessary, to filter and collect a very large dataset. Is there anything it can be done to implement this new feature ? Thank you. > [C++][R] Filter datasets with string expressions > > > Key: ARROW-10305 > URL: https://issues.apache.org/jira/browse/ARROW-10305 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Pal >Priority: Major > > Hi, > Some expressions, such as substr(), grepl(), str_detect() or others, are not > supported while filtering a dataset (after open_datatset() ). Specifically, > the code below : > {code:java} > library(dplyr) > library(arrow) > data = data.frame(a = c("a", "a2", "a3")) > write_parquet(data, "Test_filter/data.parquet") > ds <- open_dataset("Test_filter/") > data_flt <- ds %>% > filter(substr(a, 1, 1) == "a") > {code} > gives this error : > {code:java} > Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == > "a" > Call collect() first to pull data into R.{code} > These expressions may be very helpful, not to say necessary, to filter and > collect a very large dataset. Is there anything it can be done to implement > this new feature ? > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)