[jira] [Updated] (ARROW-6388) [C++] Consider implementing BufferOuputStream using BufferBuilder internally

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6388:

Fix Version/s: (was: 1.0.0)

> [C++] Consider implementing BufferOuputStream using BufferBuilder internally
> 
>
> Key: ARROW-6388
> URL: https://issues.apache.org/jira/browse/ARROW-6388
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See discussion in ARROW-6381 https://github.com/apache/arrow/pull/5222
> We should be careful that this doesn't introduce any performance regression.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8970) [C++] Reduce shared library / binary code size (umbrella issue)

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8970:

Summary: [C++] Reduce shared library / binary code size (umbrella issue)  
(was: [C++] Reduce shared library code size (umbrella issue))

> [C++] Reduce shared library / binary code size (umbrella issue)
> ---
>
> Key: ARROW-8970
> URL: https://issues.apache.org/jira/browse/ARROW-8970
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We're reaching a point where we may need to be careful about decisions that 
> increase code size:
> * Instantiating too many templates for code that isn't performance sensitive, 
> or where some templates may do the same thing (e.g. Int32Type kernels may do 
> the same thing as a Date32Type kernel)
> * Inlining functions that don't need to be inline
> Code size tends to correlate also with compilation times, but not always.
> I'll use this umbrella issue to organize issues related to reducing compiled 
> code size
> At this moment (2020-05-27), here are the 25 largest object files in a -O2 
> build
> {code}
> 524896src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o
> 531920src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o
> 552000src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o
> 575920src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o
> 595112
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o
> 645728src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o
> 683040
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o
> 702232src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o
> 729912src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o
> 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o
> 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o
> 877680src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o
> 885624src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
> 919072src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
> 941776src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o
> 1055248   src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o
> 1233304   
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o
> 1265160   src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o
> 1343480   src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o
> 1346928   src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o
> 1502568   
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o
> 1609760   
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o
> 1794416   src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o
> 2759552   
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o
> 7609432   
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8901) [C++] Reduce number of take kernels

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8901:
---

Assignee: Wes McKinney

> [C++] Reduce number of take kernels
> ---
>
> Key: ARROW-8901
> URL: https://issues.apache.org/jira/browse/ARROW-8901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> After ARROW-8792 we can observe that we are generating 312 take kernels
> {code}
> In [1]: import pyarrow.compute as pc  
> 
> In [2]: reg = pc.function_registry()  
> 
> In [3]: reg.get_function('take')  
> 
> Out[3]: 
> arrow.compute.Function
> kind: vector
> num_kernels: 312
> {code}
> You can see them all here: 
> https://gist.github.com/wesm/c3085bf40fa2ee5e555204f8c65b4ad5
> It's probably going to be sufficient to only support int16, int32, and int64 
> index types for almost all types and insert implicit casts (once we implement 
> implicit-cast-insertion into the execution code) for other index types. If we 
> determine that there is some performance hot path where we need to specialize 
> for other index types, then we can always do that.
> Additionally, we should be able to collapse the date/time kernels since we're 
> just moving memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-06-02 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123933#comment-17123933
 ] 

Wes McKinney edited comment on ARROW-4333 at 6/2/20, 3:30 PM:
--

I'd suggest we resolve this and pursue answers to some of the unanswered 
questions as more specific followups. In particular, I plan to be building 
multi-kernel expression evaluation in the near future so some of the 
pipelining/memory reuse questions must be addressed as a part of this


was (Author: wesmckinn):
I'd suggest we close this and pursue answers to some of the unanswered 
questions as more specific followups. In particular, I plan to be building 
multi-kernel expression evaluation in the near future so some of the 
pipelining/memory reuse questions must be addressed as a part of this

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-06-02 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17123933#comment-17123933
 ] 

Wes McKinney commented on ARROW-4333:
-

I'd suggest we close this and pursue answers to some of the unanswered 
questions as more specific followups. In particular, I plan to be building 
multi-kernel expression evaluation in the near future so some of the 
pipelining/memory reuse questions must be addressed as a part of this

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8966.
-
Resolution: Fixed

This was resolved in 
https://github.com/apache/arrow/commit/94a5026edb652d060110cac170380edf3d856f05

> [C++] Move arrow::ArrayData to a separate header file
> -
>
> Key: ARROW-8966
> URL: https://issues.apache.org/jira/browse/ARROW-8966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There are code modules (such as compute kernels) that only require ArrayData 
> for doing computations, so pulling in all the code in array.h is not 
> necessary. There are probably other code paths that might benefit from this 
> also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6052.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7310
[https://github.com/apache/arrow/pull/7310]

> [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to 
> builder files
> 
>
> Key: ARROW-6052
> URL: https://issues.apache.org/jira/browse/ARROW-6052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Since these files are getting larger, this would improve codebase 
> navigability. Probably should use the same naming scheme as builder_* e.g. 
> {{arrow/array/array_dict.h}}
> I recommend also putting the unit test files related to these in there for 
> better semantic organization. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9008) [C++] jemalloc_set_decay_ms precedence

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-9008:

Summary: [C++] jemalloc_set_decay_ms precedence  (was: 
jemalloc_set_decay_ms precedence)

> [C++] jemalloc_set_decay_ms precedence
> --
>
> Key: ARROW-9008
> URL: https://issues.apache.org/jira/browse/ARROW-9008
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Remi Dettai
>Priority: Major
>  Labels: jemalloc
>
> I've noticed that the jemalloc const configuration [je_arrow_malloc_conf 
> |https://github.com/apache/arrow/blob/d542482bdc6bea8a449f000bdd74de8990c20015/cpp/src/arrow/memory_pool.h#L169]
>  overrides the arrow public function 
> [jemalloc_set_decay_ms()|https://github.com/apache/arrow/blob/e4bf4297585e1d0723957833d012aaf5c119f6b0/cpp/src/arrow/memory_pool.cc#L69].
>  
> Is their a way to call jemalloc_set_decay_ms so that it has the right 
> precedence ? 
> -> if yes, I believe this should be specified in the comments
> -> if no, the function should be deprecated



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-4530) [C++] Review Aggregate kernel state allocation/ownership semantics

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4530.
---
Resolution: Later

> [C++] Review Aggregate kernel state allocation/ownership semantics
> --
>
> Key: ARROW-4530
> URL: https://issues.apache.org/jira/browse/ARROW-4530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8955) [C++] Use kernels for casting Scalar values instead of bespoke implementation

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8955.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Duplicate

duplicate of ARROW-9006

> [C++] Use kernels for casting Scalar values instead of bespoke implementation
> -
>
> Key: ARROW-8955
> URL: https://issues.apache.org/jira/browse/ARROW-8955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See details of casting in arrow/scalar.cc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8844) [C++] Optimize TransferBitmap unaligned case

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8844.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7300
[https://github.com/apache/arrow/pull/7300]

> [C++] Optimize TransferBitmap unaligned case
> 
>
> Key: ARROW-8844
> URL: https://issues.apache.org/jira/browse/ARROW-8844
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> TransferBitmap(CopyBitmap, InvertBitmap) unaligned case is processed 
> bit-by-bit[1]. Similar trick in this PR[2] may also be helpful here to 
> improve performance by processing in words.
> [1] 
> https://github.com/apache/arrow/blob/e5a33f1220705aec6a224b55d2a6f47fbd957603/cpp/src/arrow/util/bit_util.cc#L121-L134
> [2] https://github.com/apache/arrow/pull/7135



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9006) [C++] Use Cast kernels to implement Scalar::Parse and Scalar::CastTo

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9006:
---

 Summary: [C++] Use Cast kernels to implement Scalar::Parse and 
Scalar::CastTo
 Key: ARROW-9006
 URL: https://issues.apache.org/jira/browse/ARROW-9006
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We should not maintain distinct (and possibly differently behaving) 
implementations of elementwise array casting and scalar casting. The new 
kernels framework provides for relatively easily generating kernels that can 
process arrays or scalars. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8929.
-
Resolution: Fixed

Issue resolved by pull request 7322
[https://github.com/apache/arrow/pull/7322]

> [C++] Change compute::Arity:VarArgs min_args default to 0
> -
>
> Key: ARROW-8929
> URL: https://issues.apache.org/jira/browse/ARROW-8929
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The issue of minimum number of arguments is separate from providing an 
> {{InputType}} for input type checking. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-3520) [C++] Implement List Flatten kernel

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3520:
---

Assignee: Wes McKinney

> [C++] Implement List Flatten kernel
> ---
>
> Key: ARROW-3520
> URL: https://issues.apache.org/jira/browse/ARROW-3520
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>
> see also ARROW-45



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9003) [C++] Add VectorFunction wrapping arrow::Concatenate

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9003:
---

 Summary: [C++] Add VectorFunction wrapping arrow::Concatenate
 Key: ARROW-9003
 URL: https://issues.apache.org/jira/browse/ARROW-9003
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This would be a varargs function 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8929) [C++] Change compute::Arity:VarArgs min_args default to 0

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8929:
---

Assignee: Wes McKinney

> [C++] Change compute::Arity:VarArgs min_args default to 0
> -
>
> Key: ARROW-8929
> URL: https://issues.apache.org/jira/browse/ARROW-8929
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> The issue of minimum number of arguments is separate from providing an 
> {{InputType}} for input type checking. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8896) [C++] Reimplement dictionary unpacking in Cast kernels using Take

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8896:
---

Assignee: Wes McKinney

> [C++] Reimplement dictionary unpacking in Cast kernels using Take
> -
>
> Key: ARROW-8896
> URL: https://issues.apache.org/jira/browse/ARROW-8896
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> As suggested by [~apitrou] this should yield less code to maintain



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9001) [R] Box outputs as correct type in call_function

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-9001:
---

 Summary: [R] Box outputs as correct type in call_function
 Key: ARROW-9001
 URL: https://issues.apache.org/jira/browse/ARROW-9001
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This would prevent segfaults by putting the SEXP in the wrong kind of R6 
container



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7009) [C++] Refactor filter/take kernels to use Datum instead of overloads

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7009:
---

Assignee: Wes McKinney  (was: Ben Kietzman)

> [C++] Refactor filter/take kernels to use Datum instead of overloads
> 
>
> Key: ARROW-7009
> URL: https://issues.apache.org/jira/browse/ARROW-7009
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> Followup to ARROW-6784. See discussion on 
> [https://github.com/apache/arrow/pull/5686,|https://github.com/apache/arrow/pull/5686]
>  as well as ARROW-6959.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8917:
---

Assignee: Wes McKinney

> [C++][Compute] Formalize "metafunction" concept
> ---
>
> Key: ARROW-8917
> URL: https://issues.apache.org/jira/browse/ARROW-8917
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> A metafunction is a function that provides the {{Execute}} API but does not 
> contain any kernels. Such functions can also handle non-Array/Scalar inputs 
> like RecordBatch or Table. 
> This will enable bindings to invoke such functions (like take, filter) like
> {code}
> call_function('take', [table, indices])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8992) [CI][C++] march not passing correctly for docker-compose run

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8992:

Summary: [CI][C++] march not passing correctly for docker-compose run  
(was: march not passing correctly for docker-compose run)

> [CI][C++] march not passing correctly for docker-compose run
> 
>
> Key: ARROW-8992
> URL: https://issues.apache.org/jira/browse/ARROW-8992
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 0.17.0, 0.17.1
> Environment: Mendel Linux 4.0
>Reporter: Elliott Kipp
>Priority: Critical
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> [https://github.com/apache/arrow/issues/7307]
> Building on the new ASUS Tinker Edge T, running Mendel Linux 4.0 (Day). 
> docker-compose build commands work fine with no errors:
>  DEBIAN=10 ARCH=arm64v8 docker-compose build debian-cpp && DEBIAN=10 
> ARCH=arm64v8 docker-compose build debian-python
> DEBIAN=10 ARCH=arm64v8 docker-compose run debian-python - fails with the 
> following:
> – Running cmake for pyarrow
>  cmake -DPYTHON_EXECUTABLE=/usr/local/bin/python -G Ninja 
> -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_FLIGHT=on -DPYARROW_BUILD_GANDIVA=on 
> -DPYARROW_BUILD_DATASET=on -DPYARROW_BUILD_ORC=on -DPYARROW_BUILD_PARQUET=on 
> -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off 
> -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off 
> -DPYARROW_BUNDLE_BOOST=off -DPYARROW_GENERATE_COVERAGE=off 
> -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on 
> -DCMAKE_BUILD_TYPE=debug /arrow/python
>  – The C compiler identification is GNU 8.3.0
>  – The CXX compiler identification is GNU 8.3.0
>  – Check for working C compiler: /usr/lib/ccache/gcc
>  – Check for working C compiler: /usr/lib/ccache/gcc – works
>  – Detecting C compiler ABI info
>  – Detecting C compiler ABI info - done
>  – Detecting C compile features
>  – Detecting C compile features - done
>  – Check for working CXX compiler: /usr/lib/ccache/g++
>  – Check for working CXX compiler: /usr/lib/ccache/g++ – works
>  – Detecting CXX compiler ABI info
>  – Detecting CXX compiler ABI info - done
>  – Detecting CXX compile features
>  – Detecting CXX compile features - done
>  – System processor: aarch64
>  – Performing Test CXX_SUPPORTS_ARMV8_ARCH
>  – Performing Test CXX_SUPPORTS_ARMV8_ARCH - Failed
>  – Arrow build warning level: PRODUCTION
>  CMake Error at cmake_modules/SetupCxxFlags.cmake:338 (message):
>  Unsupported arch flag: -march=.
>  Call Stack (most recent call first):
>  CMakeLists.txt:100 (include)
> – Configuring incomplete, errors occurred!
>  See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeOutput.log".
>  See also "/build/python/temp.linux-aarch64-3.7/CMakeFiles/CMakeError.log".
>  error: command 'cmake' failed with exit status 1
> Tried the tarball release for both 0.17.0 and 0.17.1, same result. Also tried 
> compiling manually (following these instructions: 
> [https://dzone.com/articles/building-pyarrow-with-cuda-support]) with the 
> same result.
> Only modifications I made to source are editing the docker-compose volumes, 
> as described here: [https://github.com/apache/arrow/pull/6907]
> Jira opened, per request at: [https://github.com/apache/arrow/issues/7307]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8995) [C++] Scalar formatting code used in array/diff.cc should be reusable

2020-06-01 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121145#comment-17121145
 ] 

Wes McKinney commented on ARROW-8995:
-

Since there is already {{Scalar::ToString}}, let's use that. Does not seem 
justified to have more than one way to format scalar values. 

> [C++] Scalar formatting code used in array/diff.cc should be reusable
> -
>
> Key: ARROW-8995
> URL: https://issues.apache.org/jira/browse/ARROW-8995
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Formatting Array values as strings is not specific to the diff.cc code, so it 
> may make sense to move this code elsewhere where it can be used generally 
> (perhaps a method like {{Array::FormatValue}}?). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5854) [Python] Expose compare kernels on Array class

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5854.
-
Resolution: Fixed

Issue resolved by pull request 7273
[https://github.com/apache/arrow/pull/7273]

> [Python] Expose compare kernels on Array class
> --
>
> Key: ARROW-5854
> URL: https://issues.apache.org/jira/browse/ARROW-5854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Expose the compare kernel for comparing with scalar or array (ARROW-3087, 
> ARROW-4990) on the python Array class.
> This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8999:
---

 Summary: [Python][C++] Non-deterministic segfault in "AMD64 MacOS 
10.15 Python 3.7" build
 Key: ARROW-8999
 URL: https://issues.apache.org/jira/browse/ARROW-8999
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


I've been seeing this segfault periodically the last week, does anyone have an 
idea what might be wrong?

https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8998) [Python] Make NumPy an optional runtime dependency

2020-06-01 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8998:
---

 Summary: [Python] Make NumPy an optional runtime dependency
 Key: ARROW-8998
 URL: https://issues.apache.org/jira/browse/ARROW-8998
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


Since in the relatively near future, one will be able to do non-trivial 
analytical operations and query processing natively on Arrow data structures 
through pyarrow, it does not make sense to require users to always install 
NumPy when that install pyarrow. I propose to split the NumPy-depending parts 
of libarrow_python into a libarrow_numpy (which also must be bundled) and 
moving this part of the codebase into a separate Cython module.

This refactoring should be relatively painless though there may be a number of 
packaging details to chase up since this would introduce a new shared library 
to be installed in various packaging targets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8998) [Python] Make NumPy an optional runtime dependency

2020-06-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8998:

Description: 
Since in the relatively near future, one will be able to do non-trivial 
analytical operations and query processing natively on Arrow data structures 
through pyarrow, it does not make sense to require users to always install 
NumPy when they install pyarrow. I propose to split the NumPy-depending parts 
of libarrow_python into a libarrow_numpy (which also must be bundled) and 
moving this part of the codebase into a separate Cython module.

This refactoring should be relatively painless though there may be a number of 
packaging details to chase up since this would introduce a new shared library 
to be installed in various packaging targets. 

  was:
Since in the relatively near future, one will be able to do non-trivial 
analytical operations and query processing natively on Arrow data structures 
through pyarrow, it does not make sense to require users to always install 
NumPy when that install pyarrow. I propose to split the NumPy-depending parts 
of libarrow_python into a libarrow_numpy (which also must be bundled) and 
moving this part of the codebase into a separate Cython module.

This refactoring should be relatively painless though there may be a number of 
packaging details to chase up since this would introduce a new shared library 
to be installed in various packaging targets. 


> [Python] Make NumPy an optional runtime dependency
> --
>
> Key: ARROW-8998
> URL: https://issues.apache.org/jira/browse/ARROW-8998
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> Since in the relatively near future, one will be able to do non-trivial 
> analytical operations and query processing natively on Arrow data structures 
> through pyarrow, it does not make sense to require users to always install 
> NumPy when they install pyarrow. I propose to split the NumPy-depending parts 
> of libarrow_python into a libarrow_numpy (which also must be bundled) and 
> moving this part of the codebase into a separate Cython module.
> This refactoring should be relatively painless though there may be a number 
> of packaging details to chase up since this would introduce a new shared 
> library to be installed in various packaging targets. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8937:
---

Assignee: Wes McKinney

> [C++] Add "parse_strptime" function for string to timestamp conversions using 
> the kernels framework
> ---
>
> Key: ARROW-8937
> URL: https://issues.apache.org/jira/browse/ARROW-8937
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> This should be relatively straightforward to implement using the new kernels 
> framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7784) [C++] diff.cc is extremely slow to compile

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7784:
---

Assignee: Wes McKinney

> [C++] diff.cc is extremely slow to compile
> --
>
> Key: ARROW-7784
> URL: https://issues.apache.org/jira/browse/ARROW-7784
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 1.0.0
>
>
> This comes up especially when doing an optimized build. {{diff.cc}} is always 
> enabled even if all components are disabled, and it takes multiple seconds to 
> compile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8995) [C++] Scalar formatting code used in array/diff.cc should be reusable

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8995:
---

 Summary: [C++] Scalar formatting code used in array/diff.cc should 
be reusable
 Key: ARROW-8995
 URL: https://issues.apache.org/jira/browse/ARROW-8995
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Formatting Array values as strings is not specific to the diff.cc code, so it 
may make sense to move this code elsewhere where it can be used generally 
(perhaps a method like {{Array::FormatValue}}?). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8993) [Rust] Support gzipped json files

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8993:

Summary: [Rust] Support gzipped json files  (was: Support gzipped json 
files)

> [Rust] Support gzipped json files
> -
>
> Key: ARROW-8993
> URL: https://issues.apache.org/jira/browse/ARROW-8993
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Mohamed Zenadi
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It would be interesting to be able to read already compressed json files. 
> This is is regularly used, with many storing their files as json.gz (we do 
> the same).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8994:
---

Assignee: Wes McKinney

> [C++] Disable include-what-you-use cpplint lint checks
> --
>
> Key: ARROW-8994
> URL: https://issues.apache.org/jira/browse/ARROW-8994
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> If we want to be serious about IWYU, it would be better to use IWYU directly. 
> The minimal checks that IWYU does can be a nuisance rather than addressing 
> the problem holistically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6052) [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to builder files

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6052:
---

Assignee: Wes McKinney

> [C++] Divide up arrow/array.h,cc into files in arrow/array/ similar to 
> builder files
> 
>
> Key: ARROW-6052
> URL: https://issues.apache.org/jira/browse/ARROW-6052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> Since these files are getting larger, this would improve codebase 
> navigability. Probably should use the same naming scheme as builder_* e.g. 
> {{arrow/array/array_dict.h}}
> I recommend also putting the unit test files related to these in there for 
> better semantic organization. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8994) [C++] Disable include-what-you-use cpplint lint checks

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8994:
---

 Summary: [C++] Disable include-what-you-use cpplint lint checks
 Key: ARROW-8994
 URL: https://issues.apache.org/jira/browse/ARROW-8994
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


If we want to be serious about IWYU, it would be better to use IWYU directly. 
The minimal checks that IWYU does can be a nuisance rather than addressing the 
problem holistically



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8991) [C++][Compute] Add scalar_hash function

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8991:
---

 Summary: [C++][Compute] Add scalar_hash function
 Key: ARROW-8991
 URL: https://issues.apache.org/jira/browse/ARROW-8991
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The purpose of this function is to compute 32- or 64-bit hash values for each 
cell in an Array. Hashes for nested types can be computed recursively by 
combining the hash values of their children



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8990:

Description: 
While we have our own hash table implementation, it would be worthwhile to set 
up some benchmarks so that we can compare against std::unordered_map and some 
other thirdparty libraries for hash tables to know whether we should possibly 
use a thirdparty library. See e.g.

https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html

Libraries to consider: 

* https://github.com/sparsehash/sparsehash

  was:
While we have our own hash table implementation, it would be worthwhile to set 
up some benchmarks so that we can compare against std::unordered_map and some 
other thirdparty libraries for hash tables to know whether we should possibly 
use a thirdparty library. See e.g.

https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html


> [C++] Benchmark hash table against thirdparty options, possibly vendor a 
> thirdparty hash table library
> --
>
> Key: ARROW-8990
> URL: https://issues.apache.org/jira/browse/ARROW-8990
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> While we have our own hash table implementation, it would be worthwhile to 
> set up some benchmarks so that we can compare against std::unordered_map and 
> some other thirdparty libraries for hash tables to know whether we should 
> possibly use a thirdparty library. See e.g.
> https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html
> Libraries to consider: 
> * https://github.com/sparsehash/sparsehash



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8990:
---

 Summary: [C++] Benchmark hash table against thirdparty options, 
possibly vendor a thirdparty hash table library
 Key: ARROW-8990
 URL: https://issues.apache.org/jira/browse/ARROW-8990
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


While we have our own hash table implementation, it would be worthwhile to set 
up some benchmarks so that we can compare against std::unordered_map and some 
other thirdparty libraries for hash tables to know whether we should possibly 
use a thirdparty library. See e.g.

https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8989) [C++] Document available functions in compute::FunctionRegistry

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8989:

Summary: [C++] Document available functions in compute::FunctionRegistry  
(was: [C++] Document available functions in FunctionRegistry)

> [C++] Document available functions in compute::FunctionRegistry
> ---
>
> Key: ARROW-8989
> URL: https://issues.apache.org/jira/browse/ARROW-8989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Create a compute page in the C++ section of the Sphinx docs and make a list 
> of the available functions and what they do



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8989) [C++] Document available functions in FunctionRegistry

2020-05-31 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8989:
---

 Summary: [C++] Document available functions in FunctionRegistry
 Key: ARROW-8989
 URL: https://issues.apache.org/jira/browse/ARROW-8989
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Create a compute page in the C++ section of the Sphinx docs and make a list of 
the available functions and what they do



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8922.
-
Resolution: Fixed

Issue resolved by pull request 7278
[https://github.com/apache/arrow/pull/7278]

> [C++] Implement example string scalar kernel function to assist with string 
> kernels buildout per ARROW-555
> --
>
> Key: ARROW-8922
> URL: https://issues.apache.org/jira/browse/ARROW-8922
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I will write a patch to provide an example of creating a string-input 
> string-output kernel for executing scalar-valued string functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8988) [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work with libdfs jni

2020-05-31 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8988:

Summary: [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs 
don`t work with libdfs jni  (was: Help! After upgrade pyarrow from 0.15 to 
0.17.1 connect to hdfs don`t work with libdfs jni)

> [Python] After upgrade pyarrow from 0.15 to 0.17.1 connect to hdfs don`t work 
> with libdfs jni
> -
>
> Key: ARROW-8988
> URL: https://issues.apache.org/jira/browse/ARROW-8988
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: Pavel Dourugyan
>Priority: Major
>  Labels: beginners, hdfs, hortonworks, libhdfs, pyarrow, python3
> Attachments: 1.txt, 2.txt
>
>
> h2. Problem
> After upgrade pyarrow from 0.15 to 0.17, I have a some troubles. I 
> understand, that libhdfs3 no support now. However, in my case, libhdfs not 
> work too. See below.
> My experience in the Hadoop ecosystem is not big. Maybe, I took a some 
> wrongs. I installed Hortonworks HDP  from Ambari service on the virtual 
> machine, installed on my PC.
> I try that..
> 1.  just connect..
> %xmode Verbose
> import pyarrow as pa
> hdfs = pa.hdfs.connect(host='hdp.test.com', port=8020, user='hdfs')
> ---
> FileNotFoundError: [Errno 2] No such file or directory: 'hadoop': 'hadoop' 
> ([#1.txt])
> 2. to bypass if driver == 'libhdfs'..
> %xmode Verbose
> import pyarrow as pa
> hdfs = pa.HadoopFileSystem(host='hdp.test.com', port=8020, user='hdfs', 
> driver=None')
> ---
> OSError: Unable to load libjvm: /usr/java/latest//lib/server/libjvm.so: 
> cannot open shared object file: No such file or directory ([#2.txt])
> 3. With libhdfs3 it working:
> import hdfs3 
> hdfs = hdfs3.HDFileSystem(host='hdp.test.com', port=8020, user='hdfs')
> #ls remote folder
> hdfs.ls('/data/', detail=False)
> ['/data/TimeSheet.2020-04-11', '/data/test', '/data/test.json']
> h2. Environment.
> h4. +Client PC:+
> OS: Debian 10. Dev.: Anaconda3 (python 3.7.6), Jupyter Lab 2, pyarrow 0.17.1 
> (from conda-forge)
> +Hadoop+ (on VM – Oracle VirtualBox):
> OS: Oracle Linux 7.6.  Distr.: Hortonworks HDP 3.1.4
> libhdfs.so:
> [root@hdp /]# find / -name libhdfs.so
>  /usr/lib/ams-hbase/lib/hadoop-native/libhdfs.so
>  /usr/hdp/3.1.4.0-315/usr/lib/libhdfs.so
>  
>  Java path:
> [root@hdp /]# sudo alternatives --config java
>  
> ---
>  *+ 1       java-1.8.0-openjdk.x86_64 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/bin/java)
>  
> libjvm:               
> [root@hdp /]# find / -name libjvm.*
>  
> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so
>  /usr/jdk64/jdk1.8.0_112/jre/lib/amd64/server/libjvm.so
>  
> I tried many settings (. Below last :
> # etc/profile.
>  ...
> export JAVA_HOME=$(dirname $(dirname $(readlink $(readlink $(which javac)
> export JRE_HOME=$JAVA_HOME/jre
> export 
> JAVA_CLASSPATH=$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
> export HADOOP_HOME=/usr/hdp/3.1.4.0-315/hadoop
> export HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' 
> ' ':')
> export ARROW_LIBHDFS_DIR=/usr/lib/ams-hbase/lib/hadoop-native
> export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
> export CLASSPATH==.:$CLASSPATH:$JAVA_CLASSPATH:$HADOOP_CLASSPATH
> export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JRE_HOME/lib/amd64/server
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8916) [Python] Add relevant glue for implementing each kind of FunctionOptions

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8916.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Fixed

I think it's fine that we deal with the options on a case by case basis. Not 
that many functions will require options anyhow

> [Python] Add relevant glue for implementing each kind of FunctionOptions
> 
>
> Key: ARROW-8916
> URL: https://issues.apache.org/jira/browse/ARROW-8916
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8917:

Description: 
A metafunction is a function that provides the {{Execute}} API but does not 
contain any kernels. Such functions can also handle non-Array/Scalar inputs 
like RecordBatch or Table. 

This will enable bindings to invoke such functions (like take, filter) like

{code}
call_function('take', [table, indices])
{code}

  was:
This will enable bindings to invoke such functions (like take, filter) like

{code}
call_function('take', [table, indices])
{code}


> [C++][Compute] Formalize "metafunction" concept
> ---
>
> Key: ARROW-8917
> URL: https://issues.apache.org/jira/browse/ARROW-8917
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> A metafunction is a function that provides the {{Execute}} API but does not 
> contain any kernels. Such functions can also handle non-Array/Scalar inputs 
> like RecordBatch or Table. 
> This will enable bindings to invoke such functions (like take, filter) like
> {code}
> call_function('take', [table, indices])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8917) [C++][Compute] Formalize "metafunction" concept

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8917:

Summary: [C++][Compute] Formalize "metafunction" concept  (was: [C++] Add 
compute::Function subclass for invoking certain kernels on 
RecordBatch/Table-valued inputs)

> [C++][Compute] Formalize "metafunction" concept
> ---
>
> Key: ARROW-8917
> URL: https://issues.apache.org/jira/browse/ARROW-8917
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This will enable bindings to invoke such functions (like take, filter) like
> {code}
> call_function('take', [table, indices])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8971) [Python] Upgrade pip version in manylinux* builds

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8971.
---
Resolution: Won't Fix

The pip in the manylinux images is not at risk of experiencing this security 
issue

> [Python] Upgrade pip version in manylinux* builds
> -
>
> Key: ARROW-8971
> URL: https://issues.apache.org/jira/browse/ARROW-8971
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: bindu
>Assignee: bindu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Could you please update the pip latest version 20.1
> [https://github.com/apache/arrow/blob/2688a62f8179f20c20c06a10fcd22fe8a714ae48/python/manylinux1/scripts/requirements.txt]
> CVE-2018-20225
> An issue was discovered in pip (all versions) because it installs the version 
> with the highest version number, even if the user had intended to obtain a 
> private package from a private index. This only affects use of the 
> --extra-index-url option, and exploitation requires that the package does not 
> already exist in the public index (and thus the attacker can put the package 
> there with an arbitrary version number).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8394) [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm package

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8394:

Summary: [JS] Typescript compiler errors for arrow d.ts files, when using 
es2015-esm package  (was: Typescript compiler errors for arrow d.ts files, when 
using es2015-esm package)

> [JS] Typescript compiler errors for arrow d.ts files, when using es2015-esm 
> package
> ---
>
> Key: ARROW-8394
> URL: https://issues.apache.org/jira/browse/ARROW-8394
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.16.0
>Reporter: Shyamal Shukla
>Priority: Blocker
>
> Attempting to use apache-arrow within a web application, but typescript 
> compiler throws the following errors in some of arrow's .d.ts files
> import \{ Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow";
> export class SomeClass {
> .
> .
> constructor() {
> const t = Table.from('');
> }
> *node_modules/@apache-arrow/es2015-esm/column.d.ts:14:22* - error TS2417: 
> Class static side 'typeof Column' incorrectly extends base class static side 
> 'typeof Chunked'. Types of property 'new' are incompatible.
> *node_modules/@apache-arrow/es2015-esm/ipc/reader.d.ts:238:5* - error TS2717: 
> Subsequent property declarations must have the same type. Property 'schema' 
> must be of type 'Schema', but here has type 'Schema'.
> 238 schema: Schema;
> *node_modules/@apache-arrow/es2015-esm/recordbatch.d.ts:17:18* - error 
> TS2430: Interface 'RecordBatch' incorrectly extends interface 'StructVector'. 
> The types of 'slice(...).clone' are incompatible between these types.
> the tsconfig.json file looks like
> {
>  "compilerOptions": {
>  "target":"ES6",
>  "outDir": "dist",
>  "baseUrl": "src/"
>  },
>  "exclude": ["dist"],
>  "include": ["src/*.ts"]
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8964) [Python][Parquet] improve reading of partitioned parquet datasets whose schema changed

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8964:

Summary: [Python][Parquet] improve reading of partitioned parquet datasets 
whose schema changed  (was: Pyarrow: improve reading of partitioned parquet 
datasets whose schema changed)

> [Python][Parquet] improve reading of partitioned parquet datasets whose 
> schema changed
> --
>
> Key: ARROW-8964
> URL: https://issues.apache.org/jira/browse/ARROW-8964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.17.1
> Environment: Ubuntu 18.04, latest miniconda with python 3.7, pyarrow 
> 0.17.1
>Reporter: Ira Saktor
>Priority: Major
>
> Hi there, i'm encountering the following issue when reading from HDFS:
>  
> *My situation:*
> I have a paritioned parquet dataset in HDFS, whose recent partitions contain 
> parquet files with more columns than the older ones. When i try to read data 
> using pyarrow.dataset.dataset and filter on recent data, i still get only the 
> columns that are also contained in the old parquet files. I'd like to somehow 
> merge the schema or use the schema from parquet files from which data ends up 
> being loaded.
> *when using:*
> `pyarrow.dataset.dataset(path_to_hdfs_directory, paritioning = 'hive', 
> filters = my_filter_expression).to_table().to_pandas()`
> Is there please a way to handle schema changes in a way, that the read data 
> would contain all columns?
> everything works fine when i copy the needed parquet files into a separate 
> folder, however it is very inconvenient way of working. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8982) [CI] Remove allow_failures for s390x in TravisCI

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8982.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7301
[https://github.com/apache/arrow/pull/7301]

> [CI] Remove allow_failures for s390x in TravisCI
> 
>
> Key: ARROW-8982
> URL: https://issues.apache.org/jira/browse/ARROW-8982
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: CI
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Now, all of existing tests except Parquet pass on s390x. It is good time to 
> remove {{allow_failures}} for s390x on TravisCI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8985) [Format] Add "byte width" field with default of 16 to Decimal Flatbuffers type for forward compatibility

2020-05-30 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8985:
---

 Summary: [Format] Add "byte width" field with default of 16 to 
Decimal Flatbuffers type for forward compatibility
 Key: ARROW-8985
 URL: https://issues.apache.org/jira/browse/ARROW-8985
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


This will permit larger or smaller decimals to be added to the format later 
without having to add a new Type union value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8966:
---

Assignee: Wes McKinney

> [C++] Move arrow::ArrayData to a separate header file
> -
>
> Key: ARROW-8966
> URL: https://issues.apache.org/jira/browse/ARROW-8966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There are code modules (such as compute kernels) that only require ArrayData 
> for doing computations, so pulling in all the code in array.h is not 
> necessary. There are probably other code paths that might benefit from this 
> also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6856:
---

Assignee: Wes McKinney

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8983) [Python] Downloading sources of pyarrow and its requirements from pypi takes several minutes starting from 0.16.0

2020-05-30 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8983:

Summary: [Python] Downloading sources of pyarrow and its requirements from 
pypi takes several minutes starting from 0.16.0  (was: Downloading sources of 
pyarrow and its requirements from pypi takes several minutes starting from 
0.16.0)

> [Python] Downloading sources of pyarrow and its requirements from pypi takes 
> several minutes starting from 0.16.0
> -
>
> Key: ARROW-8983
> URL: https://issues.apache.org/jira/browse/ARROW-8983
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.17.0, 0.17.1
>Reporter: Valentyn Tymofieiev
>Priority: Minor
>
> It appears that 
>   python -m pip download --dest /tmp pyarrow==0.17.1 --no-binary :all:
> takes several minutes to execute. 
> There seems to be an increase in runtime starting from 0.16.0: on Python 2 
>  python -m pip download --dest /tmp pyarrow==0.15.1 --no-binary :all:
> appears to be somewhat faster, but the same command is still slow on Py3.
> The command is stuck for a while with "Installing build dependencies ... ", 
> and increased CPU usage.
> The intent of this command is to download source tarball for a package and 
> its dependencies.
> Some investigation was started on the mailing list: 
> https://lists.apache.org/thread.html/r9baa48a9d1517834c285f0f238f29fcf54405cb7cf1e681314239d7f%40%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

2020-05-29 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119554#comment-17119554
 ] 

Wes McKinney commented on ARROW-8976:
-

I plan to add Take and Filter "metafunctions" that deal with this and also 
Table/RecordBatch inputs

> [C++] compute::CallFunction can't Filter/Take with ChunkedArray
> ---
>
> Key: ARROW-8976
> URL: https://issues.apache.org/jira/browse/ARROW-8976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-8938
> {{Invalid: Kernel does not support chunked array arguments}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8976) [C++] compute::CallFunction can't Filter/Take with ChunkedArray

2020-05-29 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8976:
---

Assignee: Wes McKinney

> [C++] compute::CallFunction can't Filter/Take with ChunkedArray
> ---
>
> Key: ARROW-8976
> URL: https://issues.apache.org/jira/browse/ARROW-8976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-8938
> {{Invalid: Kernel does not support chunked array arguments}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8926) [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8926.
-
Resolution: Fixed

Issue resolved by pull request 7264
[https://github.com/apache/arrow/pull/7264]

> [C++] Improve docstrings in new public APIs in arrow/compute and fix 
> miscellaneous typos
> 
>
> Key: ARROW-8926
> URL: https://issues.apache.org/jira/browse/ARROW-8926
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I've noticed some imprecise language while reading the headers and some other 
> opportunities for improvement



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8978) [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" Valgrind warning

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8978:
---

Assignee: Wes McKinney

> [C++][Compute] "Conditional jump or move depends on uninitialised value(s)" 
> Valgrind warning
> 
>
> Key: ARROW-8978
> URL: https://issues.apache.org/jira/browse/ARROW-8978
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Wes McKinney
>Priority: Major
>
> https://github.com/ursa-labs/crossbow/runs/715700830#step:6:4277
> {noformat}
> [ RUN  ] TestCallScalarFunction.PreallocationCases
> ==5357== Conditional jump or move depends on uninitialised value(s)
> ==5357==at 0x51D69A6: void arrow::internal::TransferBitmap true>(unsigned char const*, long, long, long, unsigned char*) 
> (bit_util.cc:176)
> ==5357==by 0x51CE866: arrow::internal::CopyBitmap(unsigned char const*, 
> long, long, unsigned char*, long, bool) (bit_util.cc:208)
> ==5357==by 0x52B6325: 
> arrow::compute::detail::NullPropagator::PropagateSingle() (exec.cc:295)
> ==5357==by 0x52B36D1: Execute (exec.cc:378)
> ==5357==by 0x52B36D1: 
> arrow::compute::detail::PropagateNulls(arrow::compute::KernelContext*, 
> arrow::compute::ExecBatch const&, arrow::ArrayData*) (exec.cc:412)
> ==5357==by 0x52BA7F3: ExecuteBatch (exec.cc:586)
> ==5357==by 0x52BA7F3: 
> arrow::compute::detail::ScalarExecutor::Execute(std::vector std::allocator > const&, arrow::compute::detail::ExecListener*) 
> (exec.cc:542)
> ==5357==by 0x52BC21F: 
> arrow::compute::Function::Execute(std::vector std::allocator > const&, arrow::compute::FunctionOptions 
> const*, arrow::compute::ExecContext*) const (function.cc:94)
> ==5357==by 0x52B141C: 
> arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::vector > const&, 
> arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) 
> (exec.cc:937)
> ==5357==by 0x52B16F2: 
> arrow::compute::CallFunction(std::__cxx11::basic_string std::char_traits, std::allocator > const&, 
> std::vector > const&, 
> arrow::compute::ExecContext*) (exec.cc:942)
> ==5357==by 0x155515: 
> arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()::{lambda(std::__cxx11::basic_string  std::char_traits, std::allocator 
> >)#1}::operator()(std::__cxx11::basic_string, 
> std::allocator >) const (exec_test.cc:756)
> ==5357==by 0x156AF2: 
> arrow::compute::detail::TestCallScalarFunction_PreallocationCases_Test::TestBody()
>  (exec_test.cc:786)
> ==5357==by 0x5BE4862: void 
> testing::internal::HandleSehExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in 
> /opt/conda/envs/arrow/lib/libgtest.so)
> ==5357==by 0x5BDEDE2: void 
> testing::internal::HandleExceptionsInMethodIfSupported void>(testing::Test*, void (testing::Test::*)(), char const*) (in 
> /opt/conda/envs/arrow/lib/libgtest.so)
> ==5357== 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8918) [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8918.
-
Resolution: Fixed

Issue resolved by pull request 7258
[https://github.com/apache/arrow/pull/7258]

> [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching 
> to appropriate type-specific CastFunction
> --
>
> Key: ARROW-8918
> URL: https://issues.apache.org/jira/browse/ARROW-8918
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> By setting the output type in {{CastOptions}}, we can write
> {code}
> call_function("cast", [arg], cast_options)
> {code}
> This simplifies use of casting for binding developers. This mimics the 
> standard SQL
> {code}
> CAST(expr AS target_type)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8968) [C++][Gandiva] Show link warning message on s390x

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8968.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7295
[https://github.com/apache/arrow/pull/7295]

> [C++][Gandiva] Show link warning message on s390x
> -
>
> Key: ARROW-8968
> URL: https://issues.apache.org/jira/browse/ARROW-8968
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When execute gandiva test, the warning message is shown as follows
> {code}
> ~/arrow/cpp/src/gandiva$ ../../build/debug/gandiva-binary-test -V
> Running main() from 
> /home/ishizaki/arrow/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from TestBinary
> [ RUN  ] TestBinary.TestSimple
> warning: Linking two modules of different data layouts: 'precompiled' is 
> 'E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64' whereas 'codegen' is 
> 'E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64'
> [   OK ] TestBinary.TestSimple (41 ms)
> [--] 1 test from TestBinary (41 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (41 ms total)
> [  PASSED  ] 1 test.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8793.
-
Resolution: Fixed

Issue resolved by pull request 7296
[https://github.com/apache/arrow/pull/7296]

> [C++] BitUtil::SetBitsTo probably doesn't need to be inline
> ---
>
> Key: ARROW-8793
> URL: https://issues.apache.org/jira/browse/ARROW-8793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Inlining this function probably does not yield meaningful performance benefits



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8971) [Python] Upgrade pip version in manylinux* builds

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8971:

Summary: [Python] Upgrade pip version in manylinux* builds  (was: [Python] 
Upgrade pip)

> [Python] Upgrade pip version in manylinux* builds
> -
>
> Key: ARROW-8971
> URL: https://issues.apache.org/jira/browse/ARROW-8971
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: bindu
>Assignee: bindu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Could you please update the pip latest version 20.1
> [https://github.com/apache/arrow/blob/2688a62f8179f20c20c06a10fcd22fe8a714ae48/python/manylinux1/scripts/requirements.txt]
> CVE-2018-20225
> An issue was discovered in pip (all versions) because it installs the version 
> with the highest version number, even if the user had intended to obtain a 
> private package from a private index. This only affects use of the 
> --extra-index-url option, and exploitation requires that the package does not 
> already exist in the public index (and thus the attacker can put the package 
> there with an arbitrary version number).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8960) [MINOR] [FORMAT] Fix typos in comments

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8960.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7274
[https://github.com/apache/arrow/pull/7274]

> [MINOR] [FORMAT] Fix typos in comments
> --
>
> Key: ARROW-8960
> URL: https://issues.apache.org/jira/browse/ARROW-8960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Chen
>Assignee: Chen
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8960) [MINOR] [FORMAT] fix typo

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8960:
---

Assignee: Chen

> [MINOR] [FORMAT] fix typo
> -
>
> Key: ARROW-8960
> URL: https://issues.apache.org/jira/browse/ARROW-8960
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Chen
>Assignee: Chen
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8960) [MINOR] [FORMAT] Fix typos in comments

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8960:

Summary: [MINOR] [FORMAT] Fix typos in comments  (was: [MINOR] [FORMAT] fix 
typo)

> [MINOR] [FORMAT] Fix typos in comments
> --
>
> Key: ARROW-8960
> URL: https://issues.apache.org/jira/browse/ARROW-8960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Chen
>Assignee: Chen
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8960) [MINOR] [FORMAT] fix typo

2020-05-28 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8960:

Component/s: Documentation

> [MINOR] [FORMAT] fix typo
> -
>
> Key: ARROW-8960
> URL: https://issues.apache.org/jira/browse/ARROW-8960
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Chen
>Assignee: Chen
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8970:

Description: 
We're reaching a point where we may need to be careful about decisions that 
increase code size:

* Instantiating too many templates for code that isn't performance sensitive, 
or where some templates may do the same thing (e.g. Int32Type kernels may do 
the same thing as a Date32Type kernel)
* Inlining functions that don't need to be inline

Code size tends to correlate also with compilation times, but not always.

I'll use this umbrella issue to organize issues related to reducing compiled 
code size

At this moment (2020-05-27), here are the 25 largest object files in a -O2 build

{code}
524896  src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o
531920  src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o
552000  src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o
575920  src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o
595112  
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o
645728  src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o
683040  
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o
702232  src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o
729912  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o
752776  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o
752776  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o
877680  src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o
885624  src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
919072  src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
941776  src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o
1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o
1233304 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o
1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o
1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o
1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o
1502568 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o
1609760 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o
1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o
2759552 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o
7609432 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o
{code}

  was:
We're reaching a point where we may need to be careful about decisions that 
increase code size:

* Instantiating too many templates for code that isn't performance sensitive
* Inlining functions that don't need to be inline

Code size tends to correlate also with compilation times, but not always.

I'll use this umbrella issue to organize issues related to reducing compiled 
code size

At this moment (2020-05-27), here are the 25 largest object files in a -O2 build

{code}
524896  src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o
531920  src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o
552000  src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o
575920  src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o
595112  
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o
645728  src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o
683040  
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o
702232  src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o
729912  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o
752776  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o
752776  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o
877680  src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o
885624  src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
919072  src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
941776  src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o
1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o
1233304 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o
1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o
1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o
1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o
1502568 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o
1609760 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o
1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o
2759552 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o
7609432 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o
{code}


> [C++] Reduce shared library code size (umbrella issue)
> --
>
> Key: ARROW-8970
>

[jira] [Updated] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8970:

Description: 
We're reaching a point where we may need to be careful about decisions that 
increase code size:

* Instantiating too many templates for code that isn't performance sensitive
* Inlining functions that don't need to be inline

Code size tends to correlate also with compilation times, but not always.

I'll use this umbrella issue to organize issues related to reducing compiled 
code size

At this moment (2020-05-27), here are the 25 largest object files in a -O2 build

{code}
524896  src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o
531920  src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o
552000  src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o
575920  src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o
595112  
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o
645728  src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o
683040  
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o
702232  src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o
729912  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o
752776  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o
752776  src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o
877680  src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o
885624  src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
919072  src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
941776  src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o
1055248 src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o
1233304 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_compare.cc.o
1265160 src/arrow/CMakeFiles/arrow_objlib.dir/sparse_tensor.cc.o
1343480 src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csf_converter.cc.o
1346928 src/arrow/CMakeFiles/arrow_objlib.dir/array.cc.o
1502568 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_hash.cc.o
1609760 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_numeric.cc.o
1794416 src/arrow/CMakeFiles/arrow_objlib.dir/array/diff.cc.o
2759552 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_filter.cc.o
7609432 src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/vector_take.cc.o
{code}

  was:
We're reaching a point where we may need to be careful about decisions that 
increase code size:

* Instantiating too many templates for code that isn't performance sensitive
* Inlining functions that don't need to be inline

Code size tends to correlate also with compilation times, but not always.

I'll use this umbrella issue to organize issues related to reducing compiled 
code size


> [C++] Reduce shared library code size (umbrella issue)
> --
>
> Key: ARROW-8970
> URL: https://issues.apache.org/jira/browse/ARROW-8970
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We're reaching a point where we may need to be careful about decisions that 
> increase code size:
> * Instantiating too many templates for code that isn't performance sensitive
> * Inlining functions that don't need to be inline
> Code size tends to correlate also with compilation times, but not always.
> I'll use this umbrella issue to organize issues related to reducing compiled 
> code size
> At this moment (2020-05-27), here are the 25 largest object files in a -O2 
> build
> {code}
> 524896src/arrow/CMakeFiles/arrow_objlib.dir/array/builder_dict.cc.o
> 531920src/arrow/CMakeFiles/arrow_objlib.dir/filesystem/s3fs.cc.o
> 552000src/arrow/CMakeFiles/arrow_objlib.dir/json/converter.cc.o
> 575920src/arrow/CMakeFiles/arrow_objlib.dir/csv/converter.cc.o
> 595112
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_cast_string.cc.o
> 645728src/arrow/CMakeFiles/arrow_objlib.dir/type.cc.o
> 683040
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/kernels/scalar_set_lookup.cc.o
> 702232src/arrow/CMakeFiles/arrow_objlib.dir/ipc/reader.cc.o
> 729912src/arrow/CMakeFiles/arrow_objlib.dir/tensor/coo_converter.cc.o
> 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csc_converter.cc.o
> 752776src/arrow/CMakeFiles/arrow_objlib.dir/tensor/csr_converter.cc.o
> 877680src/arrow/CMakeFiles/arrow_objlib.dir/array/dict_internal.cc.o
> 885624src/arrow/CMakeFiles/arrow_objlib.dir/builder.cc.o
> 919072src/arrow/CMakeFiles/arrow_objlib.dir/scalar.cc.o
> 941776src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_internal.cc.o
> 1055248   src/arrow/CMakeFiles/arrow_objlib.dir/ipc/json_simple.cc.o
> 1233304   
> 

[jira] [Created] (ARROW-8970) [C++] Reduce shared library code size (umbrella issue)

2020-05-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8970:
---

 Summary: [C++] Reduce shared library code size (umbrella issue)
 Key: ARROW-8970
 URL: https://issues.apache.org/jira/browse/ARROW-8970
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We're reaching a point where we may need to be careful about decisions that 
increase code size:

* Instantiating too many templates for code that isn't performance sensitive
* Inlining functions that don't need to be inline

Code size tends to correlate also with compilation times, but not always.

I'll use this umbrella issue to organize issues related to reducing compiled 
code size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7784) [C++] diff.cc is extremely slow to compile

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7784:

Issue Type: Improvement  (was: Bug)

> [C++] diff.cc is extremely slow to compile
> --
>
> Key: ARROW-7784
> URL: https://issues.apache.org/jira/browse/ARROW-7784
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> This comes up especially when doing an optimized build. {{diff.cc}} is always 
> enabled even if all components are disabled, and it takes multiple seconds to 
> compile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8793) [C++] BitUtil::SetBitsTo probably doesn't need to be inline

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8793:
---

Assignee: Wes McKinney

> [C++] BitUtil::SetBitsTo probably doesn't need to be inline
> ---
>
> Key: ARROW-8793
> URL: https://issues.apache.org/jira/browse/ARROW-8793
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Inlining this function probably does not yield meaningful performance benefits



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8957) [FlightRPC][C++] Fail to build due to IpcOptions

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8957.
-
Resolution: Fixed

Issue resolved by pull request 7277
[https://github.com/apache/arrow/pull/7277]

> [FlightRPC][C++] Fail to build due to IpcOptions
> 
>
> Key: ARROW-8957
> URL: https://issues.apache.org/jira/browse/ARROW-8957
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8969:

Description: 
We are instantiating multiple versions of templates in this module for cases 
that, byte-wise, do the exact same comparison. For example:

* For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels 
for signed int / unsigned int / floating point types of the same byte width
* TimestampType can reuse int64 kernels, similarly for other date/time types
* BinaryType/StringType can share kernels

etc.

  was:
We are instantiating templates in this module for cases that, byte-wise, do the 
exact same comparison. For example:

* For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels 
for signed int / unsigned int / floating point types of the same byte width
* TimestampType can reuse int64 kernels, similarly for other date/time types
* BinaryType/StringType can share kernels

etc.


> [C++] Reduce generated code in compute/kernels/scalar_compare.cc
> 
>
> Key: ARROW-8969
> URL: https://issues.apache.org/jira/browse/ARROW-8969
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We are instantiating multiple versions of templates in this module for cases 
> that, byte-wise, do the exact same comparison. For example:
> * For equals, not_equals, we can use the same 32-bit/64-bit comparison 
> kernels for signed int / unsigned int / floating point types of the same byte 
> width
> * TimestampType can reuse int64 kernels, similarly for other date/time types
> * BinaryType/StringType can share kernels
> etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8969) [C++] Reduce generated code in compute/kernels/scalar_compare.cc

2020-05-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8969:
---

 Summary: [C++] Reduce generated code in 
compute/kernels/scalar_compare.cc
 Key: ARROW-8969
 URL: https://issues.apache.org/jira/browse/ARROW-8969
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We are instantiating templates in this module for cases that, byte-wise, do the 
exact same comparison. For example:

* For equals, not_equals, we can use the same 32-bit/64-bit comparison kernels 
for signed int / unsigned int / floating point types of the same byte width
* TimestampType can reuse int64 kernels, similarly for other date/time types
* BinaryType/StringType can share kernels

etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118061#comment-17118061
 ] 

Wes McKinney commented on ARROW-8961:
-

Ah great. I see that utf8proc includes a 1.5 MB data file, so we shouldn't be 
too cavalier about vendoring it. If utf8proc is only required when 
{{-DARROW_COMPUTE=ON}} then perhaps we can just add it as a normal thirdparty 
toolchain library

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8967) [Python] [Parquet] pyarrow.Table.to_pandas() fails to convert valid TIMESTAMP_MILLIS to pandas timestamp

2020-05-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118060#comment-17118060
 ] 

Wes McKinney commented on ARROW-8967:
-

I'm not sure if this is fixable, since pandas datetime64 data uses the 
nanosecond unit. [~jorisvandenbossche] do you know?

[~markwaddle] you can read this file fine into Arrow format, so it isn't true 
that "there is no way to read this file". You just can't convert out of bounds 
timestamps to pandas format at the moment. 

> [Python] [Parquet] pyarrow.Table.to_pandas() fails to convert valid 
> TIMESTAMP_MILLIS to pandas timestamp
> 
>
> Key: ARROW-8967
> URL: https://issues.apache.org/jira/browse/ARROW-8967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.0
>Reporter: Mark Waddle
>Priority: Major
>
> reading a parquet file with a valid TIMESTAMP_MILLIS value of -6155291520 
> (0019-06-20) results in the following error
> {noformat}
> File "pyarrow/array.pxi", line 587, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow/table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
>   File 
> "/Users/mark/.local/share/virtualenvs/parquetpy-BNIqCtDj/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 766, in table_to_blockmanager
> blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
>   File 
> "/Users/mark/.local/share/virtualenvs/parquetpy-BNIqCtDj/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 1102, in _table_to_blocks
> list(extension_columns.keys()))
>   File "pyarrow/table.pxi", line 1107, in pyarrow.lib.table_to_blocks
>   File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[ms] to timestamp[ns] would 
> result in out of bounds timestamp: -6155291520
> {noformat}
> as it stands there is no way to read this file
> i would like to be able to choose the timestamp unit when reading, much like 
> you can when writing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117932#comment-17117932
 ] 

Wes McKinney commented on ARROW-8961:
-

[~uwe] I would say it would be worth going ahead and adding utf8proc to 
conda-forge if it is not there already. 

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8963) [C++][Parquet] Parquet cpp optimize allocate memory

2020-05-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8963:

Summary: [C++][Parquet] Parquet cpp optimize allocate memory  (was: Parquet 
cpp optimize allocate memory)

> [C++][Parquet] Parquet cpp optimize allocate memory
> ---
>
> Key: ARROW-8963
> URL: https://issues.apache.org/jira/browse/ARROW-8963
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.17.1
>Reporter: yiming.xu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LeafReader::NextBatch should Reset memory first, otherwise Reserve will 
> allocate memory twice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8966) [C++] Move arrow::ArrayData to a separate header file

2020-05-27 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8966:
---

 Summary: [C++] Move arrow::ArrayData to a separate header file
 Key: ARROW-8966
 URL: https://issues.apache.org/jira/browse/ARROW-8966
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


There are code modules (such as compute kernels) that only require ArrayData 
for doing computations, so pulling in all the code in array.h is not necessary. 
There are probably other code paths that might benefit from this also. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7384) [Website] Fix search indexing warning reported by Google

2020-05-26 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117256#comment-17117256
 ] 

Wes McKinney commented on ARROW-7384:
-

I would guess it's still a problem. I think it's something to do with our 
Jekyll website configuration

> [Website] Fix search indexing warning reported by Google
> 
>
> Key: ARROW-7384
> URL: https://issues.apache.org/jira/browse/ARROW-7384
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
>
> I received the following e-mail from Google regarding arrow.apache.org (since 
> I'm an admin on the Analytics account)
> {code}
> Top Warnings
> Warnings are suggestions for improvement. Some warnings can affect your 
> appearance on Search; some might be reclassified as errors in the future. The 
> following warnings were found on your site:
> Indexed, though blocked by robots.txt
> We recommend that you fix these issues when possible to enable the best 
> experience and coverage in Google Search.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null

2020-05-26 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117254#comment-17117254
 ] 

Wes McKinney edited comment on ARROW-8956 at 5/27/20, 2:43 AM:
---

The only options for this function at the moment are true or false

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.h#L114

This is data structure equality, not elementwise analytic equality


was (Author: wesmckinn):
The only options for this function at the moment are true or false

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.h#L114

> [C++] arrow::ScalarEquals returns false when values are both null
> -
>
> Key: ARROW-8956
> URL: https://issues.apache.org/jira/browse/ARROW-8956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I wasn't sure if this was deliberate but it appeared while writing unit tests 
> and so wanted to check what was the intention before changing it. Arrays 
> compare equal when null slots are respectively null so this seems 
> inconsistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null

2020-05-26 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117254#comment-17117254
 ] 

Wes McKinney commented on ARROW-8956:
-

The only options for this function at the moment are true or false

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.h#L114

> [C++] arrow::ScalarEquals returns false when values are both null
> -
>
> Key: ARROW-8956
> URL: https://issues.apache.org/jira/browse/ARROW-8956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I wasn't sure if this was deliberate but it appeared while writing unit tests 
> and so wanted to check what was the intention before changing it. Arrays 
> compare equal when null slots are respectively null so this seems 
> inconsistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8961:
---

 Summary: [C++] Vendor utf8proc library
 Key: ARROW-8961
 URL: https://issues.apache.org/jira/browse/ARROW-8961
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is a minimal MIT-licensed library for UTF-8 data processing originally 
developed for use in Julia

https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8922) [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-8922:
---

Assignee: Wes McKinney

> [C++] Implement example string scalar kernel function to assist with string 
> kernels buildout per ARROW-555
> --
>
> Key: ARROW-8922
> URL: https://issues.apache.org/jira/browse/ARROW-8922
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I will write a patch to provide an example of creating a string-input 
> string-output kernel for executing scalar-valued string functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8956:

Description: I wasn't sure if this was deliberate but it appeared while 
writing unit tests and so wanted to check what was the intention before 
changing it. Arrays compare equal when null slots are respectively null so this 
seems inconsistent.  (was: I wasn't sure if this was deliberate but it appeared 
while writing unit tests and so wanted to check what was the intention before 
changing it. )

> [C++] arrow::ScalarEquals returns false when values are both null
> -
>
> Key: ARROW-8956
> URL: https://issues.apache.org/jira/browse/ARROW-8956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> I wasn't sure if this was deliberate but it appeared while writing unit tests 
> and so wanted to check what was the intention before changing it. Arrays 
> compare equal when null slots are respectively null so this seems 
> inconsistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8956) [C++] arrow::ScalarEquals returns false when values are both null

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8956:
---

 Summary: [C++] arrow::ScalarEquals returns false when values are 
both null
 Key: ARROW-8956
 URL: https://issues.apache.org/jira/browse/ARROW-8956
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I wasn't sure if this was deliberate but it appeared while writing unit tests 
and so wanted to check what was the intention before changing it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8955) [C++] Use kernels for casting Scalar values instead of bespoke implementation

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8955:
---

 Summary: [C++] Use kernels for casting Scalar values instead of 
bespoke implementation
 Key: ARROW-8955
 URL: https://issues.apache.org/jira/browse/ARROW-8955
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


See details of casting in arrow/scalar.cc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-1329) [C++] Define "virtual table" interface

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1329.
---
Resolution: Duplicate

Closing in favor of ARROW-8939

> [C++] Define "virtual table" interface
> --
>
> Key: ARROW-1329
> URL: https://issues.apache.org/jira/browse/ARROW-1329
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> The idea is that a virtual table may reference Arrow data that is not yet 
> available in memory. The implementation will define the semantics of how 
> columns are loaded into memory. 
> A virtual column interface will need to accompany this. For example:
> {code:language=c++}
> std::shared_ptr vtable = ...;
> std::shared_ptr vcolumn = vtable->column(i);
> std::shared_ptr = vcolumn->Materialize();
> std::shared_ptr = vtable->Materialize();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8949) [Java] Flight - getInfo() returning 0.0.0.0:47470

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8949:

Summary: [Java] Flight - getInfo() returning 0.0.0.0:47470  (was: Java 
Flight - getInfo() returning 0.0.0.0:47470)

> [Java] Flight - getInfo() returning 0.0.0.0:47470
> -
>
> Key: ARROW-8949
> URL: https://issues.apache.org/jira/browse/ARROW-8949
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 0.17.1
>Reporter: Bryce Brooks
>Priority: Major
>
> The code below is incomplete but I thought it would be good to show how I am 
> connecting. The server is Dremio. The python client works fine when I attempt 
> a simple test. I am not sure what is going on with the Java client but the 
> getInfo returns an endpoint location of 0.0.0.0:47470. 
> The info object i get back is:
> info: FlightInfo\{schema=Schema, descriptor=53 45 4C 45 43 
> 54 20 54 41 42 4C 45 5F 4E 41 4D 45 20 46 52 4F 4D 20 49 4E 46 4F 52 4D 41 54 
> 49 4F 4E 5F 53 43 48 45 4D 41 2E 56 49 45 57 53 , 
> endpoints=[FlightEndpoint{locations=[Location{uri=grpc+tcp://0.0.0.0:47470}], 
> ticket=org.apache.arrow.flight.Ticket@1ad0dd97}], bytes=-1, records=-1}
>  
> {code:java}
> // 
> private void testConnect() {
>   final String host = "somehost"; // removed for security
>   final int port = 1234;  // removed for security
>   final Location location = Location.forGrpcInsecure(host, port);
>   try (FlightClient c = flightClient(allocator, location)) {
>  c.authenticate(new BasicClientAuthHandler("username", "password"));  
>   
>  String sql = "SELECT TABLE_NAME FROM INFORMATION_SCHEMA.VIEWS";  
>  
>  FlightInfo info = c.getInfo(FlightDescriptor.command(sql.getBytes()));   
>  
>  log.info("info: " + info.toString());
>  log.info("  " + info.getDescriptor().toString());
>  log.info("  " + info.getSchema().toString());
>  log.info("  " + info.getEndpoints().size()); 
>  long total = info.getEndpoints().stream()
> .map(this::submit)
> .map(DataApiApplication::get)
> .mapToLong(Long::longValue)
> .sum();
>   log.info("" + total);
>} catch (Exception e) {
>   log.error("ERROR DURING GET");
>   log.error(e.getMessage());
>   log.error(e.getLocalizedMessage());
>}
> }
> private Future submit(FlightEndpoint e) {  
>int thisEndpoint = endpointsSubmitted.incrementAndGet();  
>log.debug("submitting flight endpoint {} with ticket {} to {}",   
>thisEndpoint, 
>new String(e.getTicket().getBytes()), 
>e.getLocations().get(0).getUri());
>RunnableReader reader = new RunnableReader(allocator, e);
>Future f = tp.submit(reader);
>log.debug("submitted flight endpoint {} with ticket {} to {}", 
>   thisEndpoint, 
>   new String(e.getTicket().getBytes()), 
>   e.getLocations().get(0).getUri());  
>return f;
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8951) [C++] Fix compiler warning in compute/kernels/scalar_cast_temporal.cc

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8951:
---

 Summary: [C++] Fix compiler warning in 
compute/kernels/scalar_cast_temporal.cc
 Key: ARROW-8951
 URL: https://issues.apache.org/jira/browse/ARROW-8951
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The kernel functor can return an uninitialized value on errors

{code}
../src/arrow/compute/kernels/scalar_cast_temporal.cc: In member function ‘OUT 
arrow::compute::internal::ParseTimestamp::Call(arrow::compute::KernelContext*, 
ARG0) const [with OUT = long int; ARG0 = 
nonstd::sv_lite::basic_string_view]’:
../src/arrow/compute/kernels/scalar_cast_temporal.cc:267:12: warning: ‘result’ 
may be used uninitialized in this function [-Wmaybe-uninitialized]
 return result;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8782) [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8782.
-
Resolution: Fixed

Issue resolved by pull request 7205
[https://github.com/apache/arrow/pull/7205]

> [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set
> -
>
> Key: ARROW-8782
> URL: https://issues.apache.org/jira/browse/ARROW-8782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I plan on adding a new benchmarks folder beneatch the datafusion crate, 
> containing benchmarks based on the NYC Taxi data set. The benchmark will be a 
> CLI and will support running a number of different queries against CSV and 
> Parquet.
> The README will contain instructions for downloading the data set.
> The benchmark will produce CSV files containing results.
> These benchmarks will allow us to manually verify performance before major 
> releases and on an ongoing basis as we make changes to 
> Arrow/Parquet/DataFusion.
> I will be basing this on existing benchmarks I recently built in Ballista [1] 
> (I am the only contributor to these benchmarks so far).
> A dockerfile will be provided, making it easy to restrict CPU and RAM when 
> running these benchmarks.
> [1] https://github.com/ballista-compute/ballista/tree/master/rust/benchmarks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8297) [FlightRPC][C++] Implement Flight DoExchange for C++

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8297.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6656
[https://github.com/apache/arrow/pull/6656]

> [FlightRPC][C++] Implement Flight DoExchange for C++
> 
>
> Key: ARROW-8297
> URL: https://issues.apache.org/jira/browse/ARROW-8297
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> As described in the mailing list vote.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8945) [Python] An independent Cython package for projects that want to program against the C data interface

2020-05-26 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8945:
---

 Summary: [Python] An independent Cython package for projects that 
want to program against the C data interface
 Key: ARROW-8945
 URL: https://issues.apache.org/jira/browse/ARROW-8945
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


I've been thinking it would be useful to have a minimal Cython package, call it 
"cyarrow", containing some pxd files and a small amount of compiled pyx code 
(using a C compiler only) that enables projects written in Cython to interact 
with Arrow datasets in minimal ways (for example, iterating over their values, 
interacting with dictionary-encoded/categorical arrays) that don't amount to 
reimplementation of the "hard stuff" where they would want to utilize pyarrow 
or the C++ library instead. Otherwise, every Python project that has compiled 
code in Cython and wants to use the C interface would have to create their own 
minimal implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8945) [Python] An independent Cython package for Cython-based projects that want to program against the C data interface

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8945:

Description: 
I've been thinking it would be useful to have a minimal Cython package, call it 
"cyarrow", containing some pxd files and a small amount of compiled pyx code 
(using a C compiler only) that enables projects written in Cython to interact 
with Arrow datasets in minimal ways (for example, iterating over their values, 
interacting with dictionary-encoded/categorical arrays) that don't amount to 
reimplementation of the "hard stuff" where they would want to utilize pyarrow 
or the C++ library instead. Otherwise, every Python project that has compiled 
code in Cython and wants to use the C interface 
(https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst)
 would have to create their own minimal implementation. 

Target user for this project would be Python projects like scikit-learn that 
are mostly written in Cython

  was:
I've been thinking it would be useful to have a minimal Cython package, call it 
"cyarrow", containing some pxd files and a small amount of compiled pyx code 
(using a C compiler only) that enables projects written in Cython to interact 
with Arrow datasets in minimal ways (for example, iterating over their values, 
interacting with dictionary-encoded/categorical arrays) that don't amount to 
reimplementation of the "hard stuff" where they would want to utilize pyarrow 
or the C++ library instead. Otherwise, every Python project that has compiled 
code in Cython and wants to use the C interface would have to create their own 
minimal implementation. 

Target user for this project would be Python projects like scikit-learn that 
are mostly written in Cython


> [Python] An independent Cython package for Cython-based projects that want to 
> program against the C data interface
> --
>
> Key: ARROW-8945
> URL: https://issues.apache.org/jira/browse/ARROW-8945
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> I've been thinking it would be useful to have a minimal Cython package, call 
> it "cyarrow", containing some pxd files and a small amount of compiled pyx 
> code (using a C compiler only) that enables projects written in Cython to 
> interact with Arrow datasets in minimal ways (for example, iterating over 
> their values, interacting with dictionary-encoded/categorical arrays) that 
> don't amount to reimplementation of the "hard stuff" where they would want to 
> utilize pyarrow or the C++ library instead. Otherwise, every Python project 
> that has compiled code in Cython and wants to use the C interface 
> (https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst)
>  would have to create their own minimal implementation. 
> Target user for this project would be Python projects like scikit-learn that 
> are mostly written in Cython



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8945) [Python] An independent Cython package for Cython-based projects that want to program against the C data interface

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8945:

Description: 
I've been thinking it would be useful to have a minimal Cython package, call it 
"cyarrow", containing some pxd files and a small amount of compiled pyx code 
(using a C compiler only) that enables projects written in Cython to interact 
with Arrow datasets in minimal ways (for example, iterating over their values, 
interacting with dictionary-encoded/categorical arrays) that don't amount to 
reimplementation of the "hard stuff" where they would want to utilize pyarrow 
or the C++ library instead. Otherwise, every Python project that has compiled 
code in Cython and wants to use the C interface would have to create their own 
minimal implementation. 

Target user for this project would be Python projects like scikit-learn that 
are mostly written in Cython

  was:I've been thinking it would be useful to have a minimal Cython package, 
call it "cyarrow", containing some pxd files and a small amount of compiled pyx 
code (using a C compiler only) that enables projects written in Cython to 
interact with Arrow datasets in minimal ways (for example, iterating over their 
values, interacting with dictionary-encoded/categorical arrays) that don't 
amount to reimplementation of the "hard stuff" where they would want to utilize 
pyarrow or the C++ library instead. Otherwise, every Python project that has 
compiled code in Cython and wants to use the C interface would have to create 
their own minimal implementation. 


> [Python] An independent Cython package for Cython-based projects that want to 
> program against the C data interface
> --
>
> Key: ARROW-8945
> URL: https://issues.apache.org/jira/browse/ARROW-8945
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> I've been thinking it would be useful to have a minimal Cython package, call 
> it "cyarrow", containing some pxd files and a small amount of compiled pyx 
> code (using a C compiler only) that enables projects written in Cython to 
> interact with Arrow datasets in minimal ways (for example, iterating over 
> their values, interacting with dictionary-encoded/categorical arrays) that 
> don't amount to reimplementation of the "hard stuff" where they would want to 
> utilize pyarrow or the C++ library instead. Otherwise, every Python project 
> that has compiled code in Cython and wants to use the C interface would have 
> to create their own minimal implementation. 
> Target user for this project would be Python projects like scikit-learn that 
> are mostly written in Cython



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8945) [Python] An independent Cython package for Cython-based projects that want to program against the C data interface

2020-05-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8945:

Summary: [Python] An independent Cython package for Cython-based projects 
that want to program against the C data interface  (was: [Python] An 
independent Cython package for projects that want to program against the C data 
interface)

> [Python] An independent Cython package for Cython-based projects that want to 
> program against the C data interface
> --
>
> Key: ARROW-8945
> URL: https://issues.apache.org/jira/browse/ARROW-8945
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> I've been thinking it would be useful to have a minimal Cython package, call 
> it "cyarrow", containing some pxd files and a small amount of compiled pyx 
> code (using a C compiler only) that enables projects written in Cython to 
> interact with Arrow datasets in minimal ways (for example, iterating over 
> their values, interacting with dictionary-encoded/categorical arrays) that 
> don't amount to reimplementation of the "hard stuff" where they would want to 
> utilize pyarrow or the C++ library instead. Otherwise, every Python project 
> that has compiled code in Cython and wants to use the C interface would have 
> to create their own minimal implementation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116339#comment-17116339
 ] 

Wes McKinney commented on ARROW-8214:
-

Yes, I would definitely like to see that happen. Using Flatbuffers is desirable 
to avoid the need to link libprotobuf.a

> [C++] Flatbuffers based serialization protocol for Expressions
> --
>
> Key: ARROW-8214
> URL: https://issues.apache.org/jira/browse/ARROW-8214
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> It might provide a more scalable solution for serialization.
> cc [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8772) [C++] Expand SumKernel benchmark to more types

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-8772.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7267
[https://github.com/apache/arrow/pull/7267]

> [C++] Expand SumKernel benchmark to more types
> --
>
> Key: ARROW-8772
> URL: https://issues.apache.org/jira/browse/ARROW-8772
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Frank Du
>Assignee: Frank Du
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Expand SumKernel benchmark to cover more types, Float, Double, Int8, Int16, 
> Int32, Int64.
> Currently it only has Int64 item, useful for further optimize job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8860) [C++] IPC/Feather decompression broken for nested arrays

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8860:

Priority: Critical  (was: Major)

> [C++] IPC/Feather decompression broken for nested arrays
> 
>
> Key: ARROW-8860
> URL: https://issues.apache.org/jira/browse/ARROW-8860
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When writing a table with a Struct typed column, this is read back with 
> garbage values when using compression (which is the default):
> {code:python}
> >>>  table = pa.table({'col': pa.StructArray.from_arrays([[0, 1, 2], [1, 2, 
> >>> 3]], names=["f1", "f2"])})
> # roundtrip through feather
> >>> feather.write_feather(table, "test_struct.feather")
> >>> table2 = feather.read_table("test_struct.feather")
> >>> table2.column("col")
> 
> [
>   -- is_valid: all not null
>   -- child 0 type: int64
> [
>   24,
>   1261641627085906436,
>   1369095386551025664
> ]
>   -- child 1 type: int64
> [
>   24,
>   1405756815161762308,
>   281479842103296
> ]
> ]
> {code}
> When not using compression, it is read back correctly:
> {code:python}
> >>> feather.write_feather(table, "test_struct.feather", 
> >>> compression="uncompressed")   
> >>>   
> >>>   
> >>> table2 = feather.read_table("test_struct.feather")
> >>>   
> >>>   
> >>> table2.column("col")  
> >>>   
> >>>   
> 
> [
>   -- is_valid: all not null
>   -- child 0 type: int64
> [
>   0,
>   1,
>   2
> ]
>   -- child 1 type: int64
> [
>   1,
>   2,
>   3
> ]
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8873) [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after delete

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8873:

Summary: [Plasma][C++] Usage model for Object IDs. Object IDs don't 
disappear after delete  (was: Usage model for Object IDs. Object IDs don't 
disappear after delete)

> [Plasma][C++] Usage model for Object IDs. Object IDs don't disappear after 
> delete
> -
>
> Key: ARROW-8873
> URL: https://issues.apache.org/jira/browse/ARROW-8873
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++, Python
>Affects Versions: 0.17.0
>Reporter: Abe Mammen
>Priority: Major
>
> I have an environment that uses Arrow + Plasma to send requests between 
> Python clients and a C++ server that responds with search results etc.
> I use a sequence number based approach for Object ID creation so its 
> understood on both sides. All that works well. So each request from the 
> client creates a unique Object ID, creates and seals it etc. On the other 
> end, a get against that Object ID retrieves the request payload, releases and 
> deletes the Object ID. A similar response scheme for Object IDs are used from 
> the server side to the client to get search results etc where it creates its 
> own unique Object ID understood by the client. The server side creates and 
> seals and the Python client side does a get and deletes the Object ID (there 
> is no release method in Python it appears). I have experimented with deleting 
> the plasma buffer.
> The end result is that as transactions build up, the server side memory use 
> goes way up and I can see that a good # of the objects aren't deleted from 
> the Plasma store until the server exits. I have nulled out the search result 
> part too so that is not what is accumulating. I have not done a memory 
> profile but wanted to get some feedback on some what might be wrong here.
> Is there a better way to use Object IDs for example? And what might be 
> causing the huge memory usage. In this example, I had ~4M transactions 
> between clients and the server which hit a memory usage of about 10+ GB which 
> is in the ballpark of the size of all the payloads. Besides doing 
> release-deletes on Object IDs, is there a better way to purge and remove 
> these objects?
> Any help is appreciated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8801) [Python] Memory leak on read from parquet file with UTC timestamps using pandas

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8801:

Priority: Blocker  (was: Major)

> [Python] Memory leak on read from parquet file with UTC timestamps using 
> pandas
> ---
>
> Key: ARROW-8801
> URL: https://issues.apache.org/jira/browse/ARROW-8801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, 
> mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, 
> ubuntu 20.04 (linux).
>Reporter: Rauli Ruohonen
>Priority: Blocker
> Fix For: 1.0.0
>
>
> Given dump.py script 
>  
> {code:java}
> import pandas as pd
> import numpy as np
> x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', 
> utc=True)
> pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', 
> compression=None)
> {code}
> and load.py script
>  
> {code:java}
> import sys
> import pandas as pd
> def foo(engine):
>     for _ in range(2**9):
>         pd.read_parquet('data.parquet', engine=engine)
>     print('Done')
>     input()
> foo(sys.argv[1])
> {code}
> running first "python dump.py" and then "python load.py pyarrow", on my 
> machine python memory usage stays at 4+ GB while it waits for input. If using 
> "python load.py fastparquet" instead, it is about 100 MB, so it should be a 
> pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is 
> removed from dump.py, in which case the timestamp is timezone-unaware.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8580) Pyarrow exceptions are not helpful

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8580.
---
Resolution: Cannot Reproduce

If you can provide a reproducible example of such an unhelpful error message, 
we will certainly fix it

> Pyarrow exceptions are not helpful
> --
>
> Key: ARROW-8580
> URL: https://issues.apache.org/jira/browse/ARROW-8580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Soroush Radpour
>Priority: Major
>
> I'm trying to understand an exception in the code using pyarrow, and it is 
> not very helpful.
> File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: IOError: b'Service Unavailable'. Detail: Python exception: 
> RuntimeError
>   
>   It would be great if each of the three exceptions was unwrapped with full 
> stack trace and error messages that came with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8671) [C++] Use IPC body compression metadata approved in ARROW-300

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8671:

Priority: Critical  (was: Major)

> [C++] Use IPC body compression metadata approved in ARROW-300 
> --
>
> Key: ARROW-8671
> URL: https://issues.apache.org/jira/browse/ARROW-8671
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 1.0.0
>
>
> This will adapt the existing code to use the new metadata, while maintaining 
> backward compatibility code to recognize the "experimental" metadata written 
> in 0.17.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >