[jira] [Updated] (ARROW-10294) [Java] Resolve problems of DecimalVector APIs on ArrowBufs

2020-10-15 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-10294:

Fix Version/s: (was: 2.0.0)
   3.0.0

> [Java] Resolve problems of DecimalVector APIs on ArrowBufs
> --
>
> Key: ARROW-10294
> URL: https://issues.apache.org/jira/browse/ARROW-10294
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Unlike other fixed width vectors, DecimalVectors have some APIs that directly 
> manipulate an ArrowBuf (e.g. \{{void set(int index, int isSet, int start, 
> ArrowBuf buffer)}}).
> After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they 
> work properly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9475) [Java] Clean up usages of BaseAllocator, use BufferAllocator instead

2020-10-15 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-9475:
---
Fix Version/s: (was: 2.0.0)
   3.0.0

> [Java] Clean up usages of BaseAllocator, use BufferAllocator instead
> 
>
> Key: ARROW-9475
> URL: https://issues.apache.org/jira/browse/ARROW-9475
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.17.0
>Reporter: Hongze Zhang
>Assignee: Hongze Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Some classes' methods use BaseAllocator or cast BufferAllocator to 
> BaseAllocator internally instead of requiring for BufferAllocator directly, 
> e.g. codes in AllocationManager, BufferLedger.
> This can be optimized by exposing necessary methods from BufferAllocator. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9475) [Java] Clean up usages of BaseAllocator, use BufferAllocator instead

2020-10-15 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-9475.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 7768
[https://github.com/apache/arrow/pull/7768]

> [Java] Clean up usages of BaseAllocator, use BufferAllocator instead
> 
>
> Key: ARROW-9475
> URL: https://issues.apache.org/jira/browse/ARROW-9475
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 0.17.0
>Reporter: Hongze Zhang
>Assignee: Hongze Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Some classes' methods use BaseAllocator or cast BufferAllocator to 
> BaseAllocator internally instead of requiring for BufferAllocator directly, 
> e.g. codes in AllocationManager, BufferLedger.
> This can be optimized by exposing necessary methods from BufferAllocator. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10294) [Java] Resolve problems of DecimalVector APIs on ArrowBufs

2020-10-15 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-10294.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8455
[https://github.com/apache/arrow/pull/8455]

> [Java] Resolve problems of DecimalVector APIs on ArrowBufs
> --
>
> Key: ARROW-10294
> URL: https://issues.apache.org/jira/browse/ARROW-10294
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Unlike other fixed width vectors, DecimalVectors have some APIs that directly 
> manipulate an ArrowBuf (e.g. \{{void set(int index, int isSet, int start, 
> ArrowBuf buffer)}}).
> After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they 
> work properly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10236:
---
Affects Version/s: 2.0.0

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-10236.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8460
[https://github.com/apache/arrow/pull/8460]

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10236) [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel

2020-10-15 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-10236:
---
Component/s: Rust

> [Rust] [DataFusion] Make DataFusion casting rules consistent with cast kernel
> -
>
> Key: ARROW-10236
> URL: https://issues.apache.org/jira/browse/ARROW-10236
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Andrew Lamb
>Assignee: Andrew Lamb
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> There are plan time checks for valid type casts in DataFusion that are 
> designed to catch errors early before plan execution
> Sadly the cast types that DataFusion thinks are valid is a significant subset 
> of what the arrow cast kernel supports.  The goal of this ticket is to bring 
> DataFusion to parity with the type casting supported by arrow and  allow 
> DataFusion to plan all casts that are supported by the arrow cast kernel
> (I want this implicitly so when I add support for DictionaryArray casts in 
> Arrow they also are part of DataFusion)
> Previously the notions of coercion and casting were somewhat conflated. I 
> have tried to clarify them in https://github.com/apache/arrow/pull/8399 as 
> well
> For more detail, see 
> https://github.com/apache/arrow/pull/8340#discussion_r501257096 from 
> [~jorgecarleitao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10187) [Rust] Test failures on 32 bit ARM (Raspberry Pi)

2020-10-15 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215162#comment-17215162
 ] 

Neville Dipale commented on ARROW-10187:


[~andygrove] 64-bit types and offsets would also be a blocker for supporting 
wasm32.

If someone completes ARROW-9453, perhaps we can gauge from that on what effort 
it takes to support 32-bit.

> [Rust] Test failures on 32 bit ARM (Raspberry Pi)
> -
>
> Key: ARROW-10187
> URL: https://issues.apache.org/jira/browse/ARROW-10187
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> Perhaps these failures are to be expected and perhaps we can't really support 
> 32 bit?
>  
> {code:java}
>  array::array::tests::test_primitive_array_from_vec stdout 
> thread 'array::array::tests::test_primitive_array_from_vec' panicked at 
> 'assertion failed: `(left == right)`
>   left: `144`,
>  right: `104`', arrow/src/array/array.rs:2383:9 
> array::array::tests::test_primitive_array_from_vec_option stdout 
> thread 'array::array::tests::test_primitive_array_from_vec_option' panicked 
> at 'assertion failed: `(left == right)`
>   left: `224`,
>  right: `176`', arrow/src/array/array.rs:2409:9 
> array::null::tests::test_null_array stdout 
> thread 'array::null::tests::test_null_array' panicked at 'assertion failed: 
> `(left == right)`
>   left: `64`,
>  right: `32`', arrow/src/array/null.rs:134:9 
> array::union::tests::test_dense_union_i32 stdout 
> thread 'array::union::tests::test_dense_union_i32' panicked at 'assertion 
> failed: `(left == right)`
>   left: `1024`,
>  right: `768`', arrow/src/array/union.rs:704:9 
> memory::tests::test_allocate stdout 
> thread 'memory::tests::test_allocate' panicked at 'assertion failed: `(left 
> == right)`
>   left: `0`,
>  right: `32`', arrow/src/memory.rs:243:13
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Frank Du (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215140#comment-17215140
 ] 

Frank Du commented on ARROW-10321:
--

For the second issue, yes, it should be better fixed in cmake side, I will 
proposal a patch soon.

For the first issue, seems like the avx512 flag is supported but not full 
avx512 intrinsic APIs exist for this compiler. Currently we are simply using 
check_cxx_compiler_flag for the simd detection, it could be more accurate to 
use check_cxx_source_compiles to perform a try-build on a example SIMD code.

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215139#comment-17215139
 ] 

Neal Richardson commented on ARROW-10321:
-

It may be similar to ARROW-9877 but that patch is included in 2.0.0, so that 
wasn't sufficient to fix this. In this case we're building on an older macOS 
SDK version than what Homebrew generally supports.

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215136#comment-17215136
 ] 

Neal Richardson commented on ARROW-10321:
-

The X's next to the commits go to Travis builds that failed. 

Here's the first, where it builds with {{-DARROW_HAVE_RUNTIME_AVX512}} when it 
should not, and it fails: 
https://travis-ci.org/github/autobrew/homebrew-core/jobs/736104290#L723

Here's the second, where I set the runtime SIMD level to AVX2 so it didn't have 
that flag, yet it still tried to compile util/bpacking_avx512.cc: 
https://travis-ci.org/github/autobrew/homebrew-core/jobs/736173702#L725

Then you can see in the discussion on the PR I applied a patch that wrapped 
that whole file in {{#if defined(ARROW_HAVE_RUNTIME_AVX512)}}, and the next 
build compiled successfully

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10174) [Java] Reading of Dictionary encoded struct vector fails

2020-10-15 Thread Liya Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan resolved ARROW-10174.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8363
[https://github.com/apache/arrow/pull/8363]

> [Java] Reading of Dictionary encoded struct vector fails 
> -
>
> Key: ARROW-10174
> URL: https://issues.apache.org/jira/browse/ARROW-10174
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 1.0.1
>Reporter: Benjamin Wilhelm
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Write an index vector and a dictionary with a dictionary vector of the type 
> {{Struct}} using an {{ArrowStreamWriter}}. Reading this again fails with an 
> exception.
> Code to reproduce:
> {code:java}
> final RootAllocator allocator = new RootAllocator();
> // Create the dictionary
> final StructVector dict = StructVector.empty("Dict", allocator);
> final NullableStructWriter dictWriter = dict.getWriter();
> final IntWriter dictA = dictWriter.integer("a");
> final IntWriter dictB = dictWriter.integer("b");
> for (int i = 0; i < 3; i++) {
>   dictWriter.start();
>   dictA.writeInt(i);
>   dictB.writeInt(i);
>   dictWriter.end();
> }
> dict.setValueCount(3);
> final Dictionary dictionary = new Dictionary(dict, new DictionaryEncoding(1, 
> false, null));
> // Create the vector
> final Random random = new Random();
> final StructVector vector = StructVector.empty("Dict", allocator);
> final NullableStructWriter vectorWriter = vector.getWriter();
> final IntWriter vectorA = vectorWriter.integer("a");
> final IntWriter vectorB = vectorWriter.integer("b");
> for (int i = 0; i < 10; i++) {
>   int v = random.nextInt(3);
>   vectorWriter.start();
>   vectorA.writeInt(v);
>   vectorB.writeInt(v);
>   vectorWriter.end();
> }
> vector.setValueCount(10);
> // Encode the vector using the dictionary
> final IntVector indexVector = (IntVector) DictionaryEncoder.encode(vector, 
> dictionary);
> // Write the vector to out
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> final VectorSchemaRoot root = new 
> VectorSchemaRoot(Collections.singletonList(indexVector.getField()),
>   Collections.singletonList(indexVector));
> final ArrowStreamWriter writer = new ArrowStreamWriter(root, new 
> MapDictionaryProvider(dictionary),
>   Channels.newChannel(out));
> writer.start();
> writer.writeBatch();
> writer.end();
> // Read the vector from out
> try (final ArrowStreamReader reader = new ArrowStreamReader(new 
> ByteArrayInputStream(out.toByteArray()),
>   allocator)) {
>   reader.loadNextBatch();
>   final VectorSchemaRoot readRoot = reader.getVectorSchemaRoot();
>   final FieldVector readIndexVector = readRoot.getVector(0);
>   // Get the dictionary and decode
>   final Map readDictionaryMap = 
> reader.getDictionaryVectors();
>   final Dictionary readDictionary = 
> readDictionaryMap.get(readIndexVector.getField().getDictionary().getId());
>   final ValueVector readVector = 
> DictionaryEncoder.decode(readIndexVector, readDictionary);
> }
> {code}
> Exception:
> {code}
> java.lang.IllegalArgumentException: not all nodes and buffers were consumed. 
> nodes: [ArrowFieldNode [length=3, nullCount=0], ArrowFieldNode [length=3, 
> nullCount=0]] buffers: [ArrowBuf[21], address:140118352739688, length:1, 
> ArrowBuf[22], address:140118352739696, length:12, ArrowBuf[23], 
> address:140118352739712, length:1, ArrowBuf[24], address:140118352739720, 
> length:12]
>   at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:63)
>   at org.apache.arrow.vector.ipc.ArrowReader.load(ArrowReader.java:241)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.loadDictionary(ArrowReader.java:232)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:129)
>   at com.knime.AppTest.testDictionaryStruct(AppTest.java:83)
> {code}
> If I see it corretly the error happens in 
> {{DictionaryUtilities#toMessageFormat}}. If a dictionary encoded vector is 
> encountered still the children of the memory format field are used (none 
> because this is Int). However, the children of the field of the dictionary 
> vector should be mapped to the message format and set as children.
> I can create a fix and open a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10323) [Release][wheel] Add missing verification setup step

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10323:
---
Labels: pull-request-available  (was: )

> [Release][wheel] Add missing verification setup step
> 
>
> Key: ARROW-10323
> URL: https://issues.apache.org/jira/browse/ARROW-10323
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Yibo Cai (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215117#comment-17215117
 ] 

Yibo Cai commented on ARROW-10321:
--

Maybe similar problem as ARROW-9877?
I didn't find build logs at the autobrew PR link.

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10323) [Release][wheel] Add missing verification setup step

2020-10-15 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-10323:


 Summary: [Release][wheel] Add missing verification setup step
 Key: ARROW-10323
 URL: https://issues.apache.org/jira/browse/ARROW-10323
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10322) [C++][Dataset] Minimize Expression to a wrapper around compute::Function

2020-10-15 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-10322:


 Summary: [C++][Dataset] Minimize Expression to a wrapper around 
compute::Function
 Key: ARROW-10322
 URL: https://issues.apache.org/jira/browse/ARROW-10322
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 3.0.0


The Expression class hierarchy was originally intended to provide generic, 
structured representations of compute functionality. On the former point they 
have been superseded by compute::{Function, Kernel, ...} which encapsulates 
validation and execution. In light of this Expression can be drastically 
simplified and improved by composition with these classes. Each responsibility 
which can be deferred implies less boilerplate when exposing a new compute 
function for use in datasets. Ideally any compute function will be immediately 
available to use in a filter or projection.

{code}
struct Expression {
  using Literal = std::shared_ptr;

  struct Projection {
std::vector names
std::vector values;
  };

  struct Call {
std::shared_ptr function;
std::shared_ptr options;
std::vector arguments;
  };

  util::variant value;
};
{code}

A simple discriminated union as above should be sufficient to represent 
arbitrary filters and projections: any expression which results in type 
{{bool}} is a valid filter, and any expression which is a {{Projection}} may be 
used to map one record batch to another.

Expression simplification (currently implemented in {{Expression::Assume}}) is 
an optimization used for example in predicate pushdown, and therefore need not 
exhaustively cover the full space of available compute functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10106:
---
Labels: flight pull-request-available  (was: flight)

> [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
> ---
>
> Key: ARROW-10106
> URL: https://issues.apache.org/jira/browse/ARROW-10106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: flight, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> OutboundStreamListener has a method isReady() that FlightProducers need to 
> poll during implementations of getStream() to avoid buffering too much data.
> An enhancement would be to allow setting a callback to run (for example, 
> notifying a CountdownLatch) so that FlightProducer implementations don't need 
> to busy wait.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215082#comment-17215082
 ] 

Neal Richardson commented on ARROW-10321:
-

Sure, though all of the build logs and discussion is available on the PR I 
linked.

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener

2020-10-15 Thread James Duong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Duong updated ARROW-10106:

Labels: flight  (was: )

> [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
> ---
>
> Key: ARROW-10106
> URL: https://issues.apache.org/jira/browse/ARROW-10106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: James Duong
>Priority: Major
>  Labels: flight
>
> OutboundStreamListener has a method isReady() that FlightProducers need to 
> poll during implementations of getStream() to avoid buffering too much data.
> An enhancement would be to allow setting a callback to run (for example, 
> notifying a CountdownLatch) so that FlightProducer implementations don't need 
> to busy wait.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10106) [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener

2020-10-15 Thread James Duong (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Duong reassigned ARROW-10106:
---

Assignee: James Duong

> [FlightRPC][Java] Expose onIsReady() callback on OutboundStreamListener
> ---
>
> Key: ARROW-10106
> URL: https://issues.apache.org/jira/browse/ARROW-10106
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Java
>Reporter: James Duong
>Assignee: James Duong
>Priority: Major
>  Labels: flight
>
> OutboundStreamListener has a method isReady() that FlightProducers need to 
> poll during implementations of getStream() to avoid buffering too much data.
> An enhancement would be to allow setting a callback to run (for example, 
> notifying a CountdownLatch) so that FlightProducer implementations don't need 
> to busy wait.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215069#comment-17215069
 ] 

Antoine Pitrou commented on ARROW-10321:


> The check for AVX512 support was returning true when in fact the compiler did 
> not support it

You should point to the logs so that we can look at the symptoms.

> [C++] Building AVX512 code when we should not
> -
>
> Key: ARROW-10321
> URL: https://issues.apache.org/jira/browse/ARROW-10321
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>
> On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
> Arrow for an old macOS SDK version, we found what I believe are 2 different 
> problems:
> 1. The check for AVX512 support was returning true when in fact the compiler 
> did not support it
> 2. Even when we manually set the runtime SIMD level to less-than-AVX512, it 
> was still trying to compile one of the AVX512 files, which failed. I added a 
> patch that made that file conditional, but there's probably a proper cmake 
> way to tell it not to compile that file at all
> cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10321) [C++] Building AVX512 code when we should not

2020-10-15 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-10321:
---

 Summary: [C++] Building AVX512 code when we should not
 Key: ARROW-10321
 URL: https://issues.apache.org/jira/browse/ARROW-10321
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Neal Richardson


On https://github.com/autobrew/homebrew-core/pull/31, where we are packaging 
Arrow for an old macOS SDK version, we found what I believe are 2 different 
problems:

1. The check for AVX512 support was returning true when in fact the compiler 
did not support it
2. Even when we manually set the runtime SIMD level to less-than-AVX512, it was 
still trying to compile one of the AVX512 files, which failed. I added a patch 
that made that file conditional, but there's probably a proper cmake way to 
tell it not to compile that file at all

cc [~yibo] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9747) [C++][Java][Format] Support Decimal256 Type

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9747:
--
Labels: pull-request-available  (was: )

> [C++][Java][Format] Support Decimal256 Type
> ---
>
> Key: ARROW-9747
> URL: https://issues.apache.org/jira/browse/ARROW-9747
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format, Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10308) [Python] read_csv from python is slow on some work loads

2020-10-15 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10308:
-
Summary: [Python] read_csv from python is slow on some work loads  (was: 
read_csv from python is slow on some work loads)

> [Python] read_csv from python is slow on some work loads
> 
>
> Key: ARROW-10308
> URL: https://issues.apache.org/jira/browse/ARROW-10308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 1.0.1
> Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: 
> https://github.com/drorspei/arrow-csv-benchmark
>Reporter: Dror Speiser
>Priority: Minor
>  Labels: csv, performance
> Attachments: Dockerfile, arrow-csv-benchmark-plot.png, 
> arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, 
> profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, 
> processing data around 0.5GiB/s. "Real workloads" means many string, float, 
> and all-null columns, and large file size (5-10GiB), though the file size 
> didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of 
> the time is spent on shared pointer lock mechanisms (though I'm not sure if 
> this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which 
> reproduces the speeds I see. Building the docker image and running it on a 
> large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly 
> around 0.5GiB/s.
> This is all also available here: 
> https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10276) [Python] Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-15 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10276:
-
Summary: [Python] Armv7 orc and flight not supported for build. Compat 
error on using with spark  (was: Armv7 orc and flight not supported for build. 
Compat error on using with spark)

> [Python] Armv7 orc and flight not supported for build. Compat error on using 
> with spark
> ---
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10301) [C++] Add "all" boolean reducing kernel

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10301:
---
Labels: analytics pull-request-available  (was: analytics)

> [C++] Add "all" boolean reducing kernel
> ---
>
> Key: ARROW-10301
> URL: https://issues.apache.org/jira/browse/ARROW-10301
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Andrew Wieteska
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: analytics, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discussed on GitHub: 
> [https://github.com/apache/arrow/pull/8294#discussion_r504034461]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10145) [C++][Dataset] Assert integer overflow in partitioning falls back to string

2020-10-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10145:
--
Fix Version/s: (was: 2.0.0)
   3.0.0

> [C++][Dataset] Assert integer overflow in partitioning falls back to string
> ---
>
> Key: ARROW-10145
> URL: https://issues.apache.org/jira/browse/ARROW-10145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: 
> (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: 
> (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: 
> (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10145) [C++][Dataset] Assert integer overflow in partitioning falls back to string

2020-10-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10145.
---
Fix Version/s: (was: 3.0.0)
   2.0.0
   Resolution: Fixed

Issue resolved by pull request 8462
[https://github.com/apache/arrow/pull/8462]

> [C++][Dataset] Assert integer overflow in partitioning falls back to string
> ---
>
> Key: ARROW-10145
> URL: https://issues.apache.org/jira/browse/ARROW-10145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> From 
> https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: 
> (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: 
> (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: 
> (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10320) Convert RecordBatchIterator to a Stream

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10320:
---
Labels: pull-request-available  (was: )

> Convert RecordBatchIterator to a Stream
> ---
>
> Key: ARROW-10320
> URL: https://issues.apache.org/jira/browse/ARROW-10320
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> So that we the unit of work is a single record batch instead of a part of a 
> partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10320) Convert RecordBatchIterator to a Stream

2020-10-15 Thread Jira
Jorge Leitão created ARROW-10320:


 Summary: Convert RecordBatchIterator to a Stream
 Key: ARROW-10320
 URL: https://issues.apache.org/jira/browse/ARROW-10320
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge Leitão
Assignee: Jorge Leitão
 Fix For: 3.0.0


So that we the unit of work is a single record batch instead of a part of a 
partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9479) [JS] Table.from fails for zero-item Lists, FixedSizeLists, Maps. ditto Table.empty

2020-10-15 Thread Brian Hulette (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette resolved ARROW-9479.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 7771
[https://github.com/apache/arrow/pull/7771]

> [JS] Table.from fails for zero-item Lists, FixedSizeLists, Maps. ditto 
> Table.empty
> --
>
> Key: ARROW-9479
> URL: https://issues.apache.org/jira/browse/ARROW-9479
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.17.1
>Reporter: Nicholas Roberts
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> deserializing zero-item tables (as generated by Table.empty or, in this case, 
> pyarrow.Schema.serialize) with a schema containing a List, FixedList or Map 
> fail due to an unconditional 
> {code:java}
> new Data(/* preceding parameters */ buffers, [childData]){code}
> statement, the childData parameter resolves to  [undefined] rather than the 
> desired [].
>  See [https://github.com/apache/arrow/pull/7771] for further details.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10319) [Flight][Go] Add Context to Client Auth Handler functions for Flight

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10319:
---
Labels: pull-request-available  (was: )

> [Flight][Go] Add Context to Client Auth Handler functions for Flight
> 
>
> Key: ARROW-10319
> URL: https://issues.apache.org/jira/browse/ARROW-10319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Go
>Reporter: Matt Topol
>Assignee: Matt Topol
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> During my usage I found that if i wanted to reuse an existing flight client 
> that required authentication, it was difficult to reuse the auth handler 
> since there wasn't a way to tell which goroutine / which auth made a 
> particular request. By passing the context to the client auth handler it 
> allows passing information to the auth handler via the context which could 
> then be utilized by consumers in order to reuse a auth handler so that an 
> entire flight client could be shared across multiple goroutines if desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10319) [Flight][Go] Add Context to Client Auth Handler functions for Flight

2020-10-15 Thread Matt Topol (Jira)
Matt Topol created ARROW-10319:
--

 Summary: [Flight][Go] Add Context to Client Auth Handler functions 
for Flight
 Key: ARROW-10319
 URL: https://issues.apache.org/jira/browse/ARROW-10319
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Go
Reporter: Matt Topol
Assignee: Matt Topol


During my usage I found that if i wanted to reuse an existing flight client 
that required authentication, it was difficult to reuse the auth handler since 
there wasn't a way to tell which goroutine / which auth made a particular 
request. By passing the context to the client auth handler it allows passing 
information to the auth handler via the context which could then be utilized by 
consumers in order to reuse a auth handler so that an entire flight client 
could be shared across multiple goroutines if desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8113) [C++] Implement a lighter-weight variant

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8113:
--
Labels: pull-request-available  (was: )

> [C++] Implement a lighter-weight variant
> 
>
> Key: ARROW-8113
> URL: https://issues.apache.org/jira/browse/ARROW-8113
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{util::variant}} is an extremely useful structure but its header slows 
> compilation significantly, so using it in public headers is questionable 
> https://github.com/apache/arrow/pull/6545#discussion_r388406246
> I'll try writing a lighter weight version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10318) [C++] Use pimpl idiom in CSV parser

2020-10-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10318:
--

 Summary: [C++] Use pimpl idiom in CSV parser
 Key: ARROW-10318
 URL: https://issues.apache.org/jira/browse/ARROW-10318
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8845) [C++] Selective compression on the wire

2020-10-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8845:
--
Component/s: C++

> [C++] Selective compression on the wire
> ---
>
> Key: ARROW-8845
> URL: https://issues.apache.org/jira/browse/ARROW-8845
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: Amol Umbarkar
>Priority: Major
>
> Dask seems to be selectively do compression if it is found to be useful. They 
> sort of pick 10kb of sample upfront to calculate compression and if the 
> results are good then the whole batch is compressed. This seems to save 
> de-compression effort on receiver side.
>  
> Please take a look at 
> [https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression]
>  
> Thought this could be relevant to arrow batch transfers as well. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8845) [C++] Selective compression on the wire

2020-10-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-8845:
--
Summary: [C++] Selective compression on the wire  (was: [c++] Selective 
compression on the wire)

> [C++] Selective compression on the wire
> ---
>
> Key: ARROW-8845
> URL: https://issues.apache.org/jira/browse/ARROW-8845
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: Amol Umbarkar
>Priority: Major
>
> Dask seems to be selectively do compression if it is found to be useful. They 
> sort of pick 10kb of sample upfront to calculate compression and if the 
> results are good then the whole batch is compressed. This seems to save 
> de-compression effort on receiver side.
>  
> Please take a look at 
> [https://blog.dask.org/2016/04/14/dask-distributed-optimizing-protocol#problem-3-unwanted-compression]
>  
> Thought this could be relevant to arrow batch transfers as well. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10314) [C++] CSV wrong row number in error message

2020-10-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214835#comment-17214835
 ] 

Antoine Pitrou commented on ARROW-10314:


Unfortunately, when reading in multi-threaded mode, several blocks are parsing 
at once, and you don't know how many rows the previous blocks contain. We would 
have to keep the error in memory for later, until the block length is resolved.

> [C++] CSV wrong row number in error message
> ---
>
> Key: ARROW-10314
> URL: https://issues.apache.org/jira/browse/ARROW-10314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
>
> When I try to read CSV file with wrong data, I get message like:
> {code:java}
> CSV file reader error: Invalid: In CSV column #0: CSV conversion error to 
> timestamp[s]: invalid value '1'
> {code}
> Would be very helpful to add information about row with wrong data e.g.
> {code:java}
> CSV file reader error: Invalid: In CSV column #0 line number #123456: CSV 
> conversion error to timestamp[s]: invalid value '1'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10315) [C++] CSV skip wrong rows

2020-10-15 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214832#comment-17214832
 ] 

Antoine Pitrou commented on ARROW-10315:


Skipping rows entirely will be difficult. We could add an option to emit nulls 
in that case, though. What do you think?

> [C++] CSV skip wrong rows
> -
>
> Key: ARROW-10315
> URL: https://issues.apache.org/jira/browse/ARROW-10315
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.1
>Reporter: Maciej
>Priority: Major
>
> It would be helpful to add another option to {color:#267f99}ReadOptions 
> {color}which will enable skipping rows with wrong data (e.g. data type 
> mismatch with column type) and continue reading next rows. Wrong rows numbers 
> may be reported at the end of processing.
> This way I can deal with the wrongly formatted data or ignore it if I have a 
> large load success rate and I don’t care about the exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10317) [C++] Consider adding documentation for FunctionOption classes

2020-10-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10317:
--

 Summary: [C++] Consider adding documentation for FunctionOption 
classes
 Key: ARROW-10317
 URL: https://issues.apache.org/jira/browse/ARROW-10317
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0


This would allow generating improved documentation for bindings (e.g. Python).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10316) [Python] Consider using __wrapped__ for compute function introspection

2020-10-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10316:
--

 Summary: [Python] Consider using __wrapped__ for compute function 
introspection
 Key: ARROW-10316
 URL: https://issues.apache.org/jira/browse/ARROW-10316
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


As suggested by [~bkietz] here:
https://github.com/apache/arrow/pull/8457#discussion_r504966207




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10316) [Python] Consider using __wrapped__ for compute function introspection

2020-10-15 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-10316:
---
Fix Version/s: 3.0.0

> [Python] Consider using __wrapped__ for compute function introspection
> --
>
> Key: ARROW-10316
> URL: https://issues.apache.org/jira/browse/ARROW-10316
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 3.0.0
>
>
> As suggested by [~bkietz] here:
> https://github.com/apache/arrow/pull/8457#discussion_r504966207



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4804) [Rust] Read temporal values from CSV

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-4804:
--
Labels: beginner  (was: )

> [Rust] Read temporal values from CSV
> 
>
> Key: ARROW-4804
> URL: https://issues.apache.org/jira/browse/ARROW-4804
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: beginner
>
> CSV reader should support reading temporal values.
> Should support timestamp, date and time, with sane defaults provided for 
> schema inference.
> To keep inference performant. user should provide a Vec of which 
> columns to try convert to a temporal array



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4803) [Rust] Read temporal values from JSON

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-4803:
--
Labels: beginner  (was: )

> [Rust] Read temporal values from JSON
> -
>
> Key: ARROW-4803
> URL: https://issues.apache.org/jira/browse/ARROW-4803
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 0.12.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: beginner
>
> Ability to parse strings that look like timestamps to timestamp type. Need to 
> consider whether only timestamp type should be supported as most JSON 
> libraries stick to ISO8601. It might also be inefficient to use regex for 
> timestamps, so the user should provide a hint of which columns to convert to 
> timestamps



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9911) [Rust][DataFusion] SELECT with no FROM clause should produce a single row of output

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9911:
--
Labels: beginner  (was: )

> [Rust][DataFusion] SELECT  with no FROM clause should produce a 
> single row of output
> 
>
> Key: ARROW-9911
> URL: https://issues.apache.org/jira/browse/ARROW-9911
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Minor
>  Labels: beginner
>
> This is somewhat of a special case, but it is useful for demonstration / 
> testing expressions. 
> A select expression with no where clause, such as "select 1" should produce a 
> single row. Today datafusion accepts the query but produces no rows.
> Actual output:
> {code}
> arrow/rust$ cargo run --release  --bin datafusion-cli 
> Finished release [optimized] target(s) in 0.25s
>  Running `target/release/datafusion-cli`
> > select 1 ;
> 0 rows in set. Query took 0 seconds.
> {code}
> Expected output is a single row, with the value 1. Here is an example using 
> SQLLite :
> {code}
> $ sqlite3 
> SQLite version 3.28.0 2019-04-15 14:49:49
> Enter ".help" for usage hints.
> Connected to a transient in-memory database.
> Use ".open FILENAME" to reopen on a persistent database.
> sqlite> select 1;
> 1
> sqlite> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9911) [Rust][DataFusion] SELECT with no FROM clause should produce a single row of output

2020-10-15 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-9911:
--
Component/s: Rust - DataFusion
 Rust

> [Rust][DataFusion] SELECT  with no FROM clause should produce a 
> single row of output
> 
>
> Key: ARROW-9911
> URL: https://issues.apache.org/jira/browse/ARROW-9911
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Andrew Lamb
>Priority: Minor
>
> This is somewhat of a special case, but it is useful for demonstration / 
> testing expressions. 
> A select expression with no where clause, such as "select 1" should produce a 
> single row. Today datafusion accepts the query but produces no rows.
> Actual output:
> {code}
> arrow/rust$ cargo run --release  --bin datafusion-cli 
> Finished release [optimized] target(s) in 0.25s
>  Running `target/release/datafusion-cli`
> > select 1 ;
> 0 rows in set. Query took 0 seconds.
> {code}
> Expected output is a single row, with the value 1. Here is an example using 
> SQLLite :
> {code}
> $ sqlite3 
> SQLite version 3.28.0 2019-04-15 14:49:49
> Enter ".help" for usage hints.
> Connected to a transient in-memory database.
> Use ".open FILENAME" to reopen on a persistent database.
> sqlite> select 1;
> 1
> sqlite> 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214762#comment-17214762
 ] 

Neal Richardson commented on ARROW-10305:
-

In terms of compute kernels, I think at least some of the pattern matching and 
extraction is happening in ARROW-10195.

Another missing piece, which [~bkietz] was writing up a JIRA for, is being able 
to create dataset expressions that call any arbitrary compute function. 

> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering a dataset (after open_datatset() ). Specifically, 
> the code below :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10315) [C++] CSV skip wrong rows

2020-10-15 Thread Maciej (Jira)
Maciej created ARROW-10315:
--

 Summary: [C++] CSV skip wrong rows
 Key: ARROW-10315
 URL: https://issues.apache.org/jira/browse/ARROW-10315
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.1
Reporter: Maciej


It would be helpful to add another option to {color:#267f99}ReadOptions 
{color}which will enable skipping rows with wrong data (e.g. data type mismatch 
with column type) and continue reading next rows. Wrong rows numbers may be 
reported at the end of processing.

This way I can deal with the wrongly formatted data or ignore it if I have a 
large load success rate and I don’t care about the exceptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10314) [C++] CSV wrong row number in error message

2020-10-15 Thread Maciej (Jira)
Maciej created ARROW-10314:
--

 Summary: [C++] CSV wrong row number in error message
 Key: ARROW-10314
 URL: https://issues.apache.org/jira/browse/ARROW-10314
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.1
Reporter: Maciej


When I try to read CSV file with wrong data, I get message like:
{code:java}
CSV file reader error: Invalid: In CSV column #0: CSV conversion error to 
timestamp[s]: invalid value '1'
{code}
Would be very helpful to add information about row with wrong data e.g.
{code:java}
CSV file reader error: Invalid: In CSV column #0 line number #123456: CSV 
conversion error to timestamp[s]: invalid value '1'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10313:
---
Labels: pull-request-available  (was: )

> [C++] Improve UTF8 validation speed and CSV string conversion
> -
>
> Key: ARROW-10313
> URL: https://issues.apache.org/jira/browse/ARROW-10313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV 
> string conversion.
> This is because we must validate many small UTF8 strings individually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10313) [C++] Improve UTF8 validation speed and CSV string conversion

2020-10-15 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-10313:
--

 Summary: [C++] Improve UTF8 validation speed and CSV string 
conversion
 Key: ARROW-10313
 URL: https://issues.apache.org/jira/browse/ARROW-10313
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 3.0.0


Based on profiling from ARROW-10308, UTF8 validation is a bottleneck of CSV 
string conversion.

This is because we must validate many small UTF8 strings individually.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10122:
---
Labels: pull-request-available  (was: )

> [Python] Selecting one column of multi-index results in a duplicated value 
> column.
> --
>
> Key: ARROW-10122
> URL: https://issues.apache.org/jira/browse/ARROW-10122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: arrow 1.0.1
> parquet 1.5.1
> pandas 1.1.0
> pyarrow 1.0.1
>Reporter: Troy Zimmerman
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I read one column of a multi-index, that column is duplicated as a value 
> column in the resulting Pandas data frame.
> {code:python}
> >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), 
> >> "value": np.arange(5)}) 
> >>> df = table.to_pandas().set_index(["first", "second"])
> >>> print(df)
>   value
> first second
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> 4 4   4
> >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
> >>> data = ds.dataset("/tmp/test.parquet")
> {code}
> This works as expected, as does selecting all or no columns.
> {code:python}
> >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
>   value
> first second
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> 4 4   4
> {code}
> This does not work as expected, as the {{first}} column is both an index and 
> a value.
> {code:python}
> >>> print(data.to_table(columns=["first", "value"]).to_pandas())
>first  value
> first
> 0  0  0
> 1  1  1
> 2  2  2
> 3  3  3
> 4  4  4{code}
> This is easy to workaround by specifying the full multi-index in 
> {{to_table}}, but does this behavior make sense?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10122) [Python] Selecting one column of multi-index results in a duplicated value column.

2020-10-15 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214652#comment-17214652
 ] 

Joris Van den Bossche commented on ARROW-10122:
---

A reproducer without involving parquet or dataset:

{code}
In [28]: table = pa.table({"first": list(range(5)), "second": list(range(5)), 
"value": np.arange(5)})

In [29]: df = table.to_pandas().set_index(["first", "second"])

In [30]: table = pa.Table.from_pandas(df)

In [31]: table.to_pandas()
Out[31]: 
  value
first second   
0 0   0
1 1   1
2 2   2
3 3   3
4 4   4

In [32]: table.select(["first", "value"]).to_pandas()
Out[32]: 
   first  value
first  
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
{code}

> [Python] Selecting one column of multi-index results in a duplicated value 
> column.
> --
>
> Key: ARROW-10122
> URL: https://issues.apache.org/jira/browse/ARROW-10122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: arrow 1.0.1
> parquet 1.5.1
> pandas 1.1.0
> pyarrow 1.0.1
>Reporter: Troy Zimmerman
>Priority: Minor
>
> When I read one column of a multi-index, that column is duplicated as a value 
> column in the resulting Pandas data frame.
> {code:python}
> >> tbl = pa.table({"first": list(range(5)), "second": list(range(5)), 
> >> "value": np.arange(5)}) 
> >>> df = table.to_pandas().set_index(["first", "second"])
> >>> print(df)
>   value
> first second
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> 4 4   4
> >>> pq.write_table(pa.Table.from_pandas(df), "/tmp/test.parquet")
> >>> data = ds.dataset("/tmp/test.parquet")
> {code}
> This works as expected, as does selecting all or no columns.
> {code:python}
> >>> print(data.to_table(columns=["first", "second", "value"]).to_pandas())
>   value
> first second
> 0 0   0
> 1 1   1
> 2 2   2
> 3 3   3
> 4 4   4
> {code}
> This does not work as expected, as the {{first}} column is both an index and 
> a value.
> {code:python}
> >>> print(data.to_table(columns=["first", "value"]).to_pandas())
>first  value
> first
> 0  0  0
> 1  1  1
> 2  2  2
> 3  3  3
> 4  4  4{code}
> This is easy to workaround by specifying the full multi-index in 
> {{to_table}}, but does this behavior make sense?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10312) Implement unix_timestamp function in gandiva

2020-10-15 Thread Naman Udasi (Jira)
Naman Udasi created ARROW-10312:
---

 Summary: Implement unix_timestamp function in gandiva
 Key: ARROW-10312
 URL: https://issues.apache.org/jira/browse/ARROW-10312
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Naman Udasi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10311) [Release] Update crossbow verification process

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10311:
---
Labels: pull-request-available  (was: )

> [Release] Update crossbow verification process
> --
>
> Key: ARROW-10311
> URL: https://issues.apache.org/jira/browse/ARROW-10311
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The automatized crossbow RC verification tasks needs to be updated since 
> multiple builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10311) [Release] Update crossbow verification process

2020-10-15 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-10311:
---

 Summary: [Release] Update crossbow verification process
 Key: ARROW-10311
 URL: https://issues.apache.org/jira/browse/ARROW-10311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Krisztian Szucs
 Fix For: 3.0.0


The automatized crossbow RC verification tasks needs to be updated since 
multiple builds are failing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10306) [C++] Add string replacement kernel

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10306:
---
Labels: pull-request-available  (was: )

> [C++] Add string replacement kernel 
> 
>
> Key: ARROW-10306
> URL: https://issues.apache.org/jira/browse/ARROW-10306
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Maarten Breddels
>Assignee: Maarten Breddels
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Similar to 
> [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html]
>  with a plain variant, and optionally a RE2 version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10310) [C++][Gandiva] Add single argument round() in Gandiva

2020-10-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10310:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Add single argument round() in Gandiva
> -
>
> Key: ARROW-10310
> URL: https://issues.apache.org/jira/browse/ARROW-10310
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: Sagnik Chakraborty
>Assignee: Sagnik Chakraborty
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-15 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214575#comment-17214575
 ] 

Joris Van den Bossche commented on ARROW-10305:
---

[~palgal] Thanks for opening the issue. Such a substring matching filter is 
indeed not yet implemented. 

A first step to enable this, would be to have a "compute kernel" for substrings 
(from looking at the overview at 
https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst, I 
don't think we currently have functionality to create such substrings). 
A related compute kernel is actually {{match_substring}} with which you could 
check that (using your example) "a" is present in the string. But, that doesn't 
easily guarantee anything about the position of the substring in the string 
(although with a regular expression pattern, you could achieve this in some 
ways).

Then, a second step would be to be able to "express" such a compute kernel in 
an Expression that can be used to filter the dataset (although this might not 
be needed for the dplyr syntax? It could maybe also be done with an actual 
compute filter kernel? cc [~npr]?). 

> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering a dataset (after open_datatset() ). Specifically, 
> the code below :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9128) [C++] Implement string space trimming kernels: trim, ltrim, and rtrim

2020-10-15 Thread Maarten Breddels (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maarten Breddels reassigned ARROW-9128:
---

Assignee: Maarten Breddels

> [C++] Implement string space trimming kernels: trim, ltrim, and rtrim
> -
>
> Key: ARROW-9128
> URL: https://issues.apache.org/jira/browse/ARROW-9128
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Maarten Breddels
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10310) [C++][Gandiva] Add single argument round() in Gandiva

2020-10-15 Thread Sagnik Chakraborty (Jira)
Sagnik Chakraborty created ARROW-10310:
--

 Summary: [C++][Gandiva] Add single argument round() in Gandiva
 Key: ARROW-10310
 URL: https://issues.apache.org/jira/browse/ARROW-10310
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Sagnik Chakraborty
Assignee: Sagnik Chakraborty






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10305) [C++][R] Filter datasets with string expressions

2020-10-15 Thread Pal (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pal updated ARROW-10305:

Description: 
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering a dataset (after open_datatset() ). Specifically, the 
code below :
{code:java}
library(dplyr)
library(arrow)
data = data.frame(a = c("a", "a2", "a3"))
write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.

  was:
Hi,

Some expressions, such as substr(), grepl(), str_detect() or others, are not 
supported while filtering after open_datatset(). Specifically, the code below :
{code:java}
library(dplyr)
library(arrow)
data = data.frame(a = c("a", "a2", "a3"))
write_parquet(data, "Test_filter/data.parquet")
ds <- open_dataset("Test_filter/")
data_flt <- ds %>% 
 filter(substr(a, 1, 1) == "a")
{code}
gives this error :
{code:java}
Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
"a"
 Call collect() first to pull data into R.{code}
These expressions may be very helpful, not to say necessary, to filter and 
collect a very large dataset. Is there anything it can be done to implement 
this new feature ?

Thank you.


> [C++][R] Filter datasets with string expressions
> 
>
> Key: ARROW-10305
> URL: https://issues.apache.org/jira/browse/ARROW-10305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Pal
>Priority: Major
>
> Hi,
> Some expressions, such as substr(), grepl(), str_detect() or others, are not 
> supported while filtering a dataset (after open_datatset() ). Specifically, 
> the code below :
> {code:java}
> library(dplyr)
> library(arrow)
> data = data.frame(a = c("a", "a2", "a3"))
> write_parquet(data, "Test_filter/data.parquet")
> ds <- open_dataset("Test_filter/")
> data_flt <- ds %>% 
>  filter(substr(a, 1, 1) == "a")
> {code}
> gives this error :
> {code:java}
> Error: Filter expression not supported for Arrow Datasets: substr(a, 1, 1) == 
> "a"
>  Call collect() first to pull data into R.{code}
> These expressions may be very helpful, not to say necessary, to filter and 
> collect a very large dataset. Is there anything it can be done to implement 
> this new feature ?
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)