[jira] [Updated] (ARROW-8462) [Python] Crash in lib.concat_tables on Windows

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8462:

Summary: [Python] Crash in lib.concat_tables on Windows  (was: Crash in 
lib.concat_tables on Windows)

> [Python] Crash in lib.concat_tables on Windows
> --
>
> Key: ARROW-8462
> URL: https://issues.apache.org/jira/browse/ARROW-8462
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Tom Augspurger
>Priority: Major
>
> This crashes for me with pyarrow 0.16 on my Windows VM
> {code:python}
> import pyarrow as pa
> import pandas as pd
> t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
> print("concat")
> pa.lib.concat_tables([t])
> print('done')
> {code}
> Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
> info on windows unfortunately. With `python -X faulthandler` I see
> {code}
> Windows fatal exception: access violation
> Current thread 0x04f8 (most recent call first):
>   File "bug.py", line 6 in (module)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8293) [Python] Run flake8 on python/examples also

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8293:

Fix Version/s: 1.0.0

> [Python] Run flake8 on python/examples also
> ---
>
> Key: ARROW-8293
> URL: https://issues.apache.org/jira/browse/ARROW-8293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There are flakes in these files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116250#comment-17116250
 ] 

Wes McKinney commented on ARROW-8214:
-

We will need to create a serialization scheme for general array expressions 
(for use with arrow/compute)

> [C++] Flatbuffers based serialization protocol for Expressions
> --
>
> Key: ARROW-8214
> URL: https://issues.apache.org/jira/browse/ARROW-8214
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> It might provide a more scalable solution for serialization.
> cc [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8180) [C++] Should default_memory_pool() be in arrow/type_fwd.h?

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8180.
---
Resolution: Not A Problem

Closing as not a problem

> [C++] Should default_memory_pool() be in arrow/type_fwd.h?
> --
>
> Key: ARROW-8180
> URL: https://issues.apache.org/jira/browse/ARROW-8180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This seemed somewhat odd to me. It might be better from an IWYU-perspective 
> to move this to arrow/memory_pool.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8173) [C++] Validate ChunkedArray()'s arguments

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116246#comment-17116246
 ] 

Wes McKinney commented on ARROW-8173:
-

{{ChunkedArray::MakeSafe}}?

> [C++] Validate ChunkedArray()'s arguments
> -
>
> Key: ARROW-8173
> URL: https://issues.apache.org/jira/browse/ARROW-8173
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> ChunkedArray has constraints on type uniformity of chunks which are currently 
> only expressed in comments. At minimum debug checks should be added to ensure 
> (for example) that an explicit type is shared by all chunks, at best the 
> public constructor should be replaced with 
> {{Result> ChunkedArray::Make(...)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7871) [Python] Expose more compute kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116245#comment-17116245
 ] 

Wes McKinney commented on ARROW-7871:
-

I unassigned the issue from myself. Perhaps some others can write a PR that 
adds simple wrappers to {{pyarrow.compute.call_function}} for the missing 
function types

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7871) [Python] Expose more compute kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116244#comment-17116244
 ] 

Wes McKinney commented on ARROW-7871:
-

This is extremely easy now since functions/kernels can now be exposed in pure 
Python using only their names. 

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Wes McKinney
>Priority: Major
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7871) [Python] Expose more compute kernels

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7871:
---

Assignee: (was: Wes McKinney)

> [Python] Expose more compute kernels
> 
>
> Key: ARROW-7871
> URL: https://issues.apache.org/jira/browse/ARROW-7871
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> Currently only the sum kernel is exposed.
> Or consider to deprecate/remove the pyarrow.compute module, and bind the 
> compute kernels as methods instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7822) [C++] Allocation free error Status constants

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116243#comment-17116243
 ] 

Wes McKinney commented on ARROW-7822:
-

I'm not sure that non-OK Status should ever be found on a performance hot path. 
That would indicate that Status is being used inappropriately for control flow. 
Unless I have misunderstood the issue?

> [C++] Allocation free error Status constants
> 
>
> Key: ARROW-7822
> URL: https://issues.apache.org/jira/browse/ARROW-7822
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>
> {{Status::state_}} could be made a tagged pointer without affecting the fast 
> path (passing around a non error status). The extra bit could be used to mark 
> a Status' state as heap allocated or not, allowing very error statuses to be 
> extremely cheap when their error state is known to be immutable. For example, 
> this would allow a cheap default of {{Result<>::status_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7784) [C++] diff.cc is extremely slow to compile

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116238#comment-17116238
 ] 

Wes McKinney commented on ARROW-7784:
-

"QuadraticSpaceMyersDiff" is being instantiated for every Arrow type. Given 
that this code is not performance sensitive, I would suggest refactoring this 
code to only instantiate a single implementation of the Diff algorithm (rather 
than 25+ instantiations) and where relevant introduce a virtual interface for 
interacting with values in different-type arrays. 

> [C++] diff.cc is extremely slow to compile
> --
>
> Key: ARROW-7784
> URL: https://issues.apache.org/jira/browse/ARROW-7784
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> This comes up especially when doing an optimized build. {{diff.cc}} is always 
> enabled even if all components are disabled, and it takes multiple seconds to 
> compile. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7409) [C++][Python] Windows link error LNK1104: cannot open file 'python37_d.lib'

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116234#comment-17116234
 ] 

Wes McKinney commented on ARROW-7409:
-

[~rbocanegra] any update?

> [C++][Python] Windows link error LNK1104: cannot open file 'python37_d.lib'
> ---
>
> Key: ARROW-7409
> URL: https://issues.apache.org/jira/browse/ARROW-7409
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Raul Bocanegra
>Assignee: Raul Bocanegra
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: fix-msvc-link-python-debug.patch
>
>
> When I build arrow_python on Windows in debug mode it raises a link error 
> "{{LNK1104: cannot open file 'python37_d.lib'".}}
> I have been having a look at the CMake files and it seems that we are forcing 
> to link against release python lib on debug mode.
> I have edited the CMake files in order to fix this bug, see 
> [^fix-msvc-link-python-debug.patch].
> It is just a 3 lines change and makes the debug version of arrow_python link 
> on Windows.
> I could do a PR if you find it useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8939) [C++] Arrow-native C++ Data Frame-style programming interface for analytics (umbrella issue)

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8939:

Summary: [C++] Arrow-native C++ Data Frame-style programming interface for 
analytics (umbrella issue)  (was: [C++] Arrow C++ Data Frame-style programming 
interface for analytics (umbrella issue))

> [C++] Arrow-native C++ Data Frame-style programming interface for analytics 
> (umbrella issue)
> 
>
> Key: ARROW-8939
> URL: https://issues.apache.org/jira/browse/ARROW-8939
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This is an umbrella issue for the "C++ Data Frame" project that has been 
> discussed on the mailing list with the following Google docs overview
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit
> I will attach issues to this JIRA to help organize and track the project as 
> we make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8939) [C++] Arrow C++ Data Frame-style programming interface for analytics (umbrella issue)

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8939:
---

 Summary: [C++] Arrow C++ Data Frame-style programming interface 
for analytics (umbrella issue)
 Key: ARROW-8939
 URL: https://issues.apache.org/jira/browse/ARROW-8939
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This is an umbrella issue for the "C++ Data Frame" project that has been 
discussed on the mailing list with the following Google docs overview

https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit

I will attach issues to this JIRA to help organize and track the project as we 
make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7394) [C++][DataFrame] Implement zero-copy optimizations when performing Filter

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-7394:
---

Assignee: Wes McKinney

> [C++][DataFrame] Implement zero-copy optimizations when performing Filter
> -
>
> Key: ARROW-7394
> URL: https://issues.apache.org/jira/browse/ARROW-7394
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> For high-selectivity filters (most elements included), it may be wasteful and 
> slow to copy large contiguous ranges of array chunks into the resulting 
> ChunkedArray. Instead, we can scan the filter boolean array and slice off 
> chunks of the source data rather than copying. 
> We will need to empirically determine how large the contiguous range needs to 
> be in order to merit the slice-based approach versus simple/native 
> materialization. For example, in a filter array like
> 1 0 1 0 1 0 1 0 1
> it would not make sense to slice 5 times because slicing carries some 
> overhead. But if we had
> 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 
> then performing 4 slices may be faster than doing a copy materialization. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7245) [C++] Allow automatic String -> LargeString promotions when concatenating tables

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116231#comment-17116231
 ] 

Wes McKinney commented on ARROW-7245:
-

Perhaps Concatenate can be reimplemented as a vector kernel, so that type 
promotions can be handled by the kernel execution machinery

> [C++] Allow automatic String -> LargeString promotions when concatenating 
> tables
> 
>
> Key: ARROW-7245
> URL: https://issues.apache.org/jira/browse/ARROW-7245
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> inspired by GitHub issue https://github.com/apache/arrow/issues/5874



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7316) [C++] compile error due to incomplete type for unique_ptr

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-7316.
---
Resolution: Cannot Reproduce

> [C++] compile error due to incomplete type for unique_ptr
> -
>
> Key: ARROW-7316
> URL: https://issues.apache.org/jira/browse/ARROW-7316
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
> Environment: WSL, conda, arrow version 0.15
>Reporter: Danny Kim
>Priority: Major
>
> Hi, 
> I am getting following compile error from Arrow c++
> {code:java}
> Warning: Can't read registry to find the necessary compiler setting 
> Make sure that Python modules winreg, win32api or win32con are installed.C 
> compiler: /home/danny/miniconda3/envs/DEV/bin/x86_64-conda_cos6-linux-gnu-cc 
> -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall 
> -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
> -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell 
> -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt 
> -O2 -pipe -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
> -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG 
> -D_FORTIFY_SOURCE=2 -O2 -fPIC 
> compile options: '-DBUILTIN_PARQUET_READER -I. 
> -I/home/danny/miniconda3/envs/DEV/include 
> -I/home/danny/miniconda3/envs/DEV/include/python3.7m -c'
> extra options: '-std=c++11 -g0 -O3'
> x86_64-conda_cos6-linux-gnu-cc: bodo/io/_parquet.cpp
> x86_64-conda_cos6-linux-gnu-cc: bodo/io/_parquet_reader.cpp
> cc1plus: warning: command line option '-Wstrict-prototypes' is valid for 
> C/ObjC but not for C++
> cc1plus: warning: command line option '-Wstrict-prototypes' is valid for 
> C/ObjC but not for C++
> In file included from 
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/memory:80:0,
>  from /home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:22,
>  from 
> bodo/io/_parquet.cpp:13:/home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:
>  In instantiation of 'void std::default_delete<_Tp>::operator()(_Tp*) const 
> [with _Tp = arrow::RecordBatchReader]':
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:268:17:
>  required from 'std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = 
> arrow::RecordBatchReader; _Dp = 
> std::default_delete]'/home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:161:49:
>  required from here
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/unique_ptr.h:76:22:
>  error: invalid application of 'sizeof' to incomplete type 
> 'arrow::RecordBatchReader'
>  static_assert(sizeof(_Tp)>0, ^In file included from 
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr.h:52:0,
>  from 
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/memory:81,
>  from /home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:22,
>  from bodo/io/_parquet.cpp:13:
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:
>  In instantiation of 'std::__shared_ptr<_Tp, _Lp>::__shared_ptr(_Yp*) 
> [with _Yp = arrow::RecordBatchReader;  = void; _Tp = 
> arrow::RecordBatchReader; __gnu_cxx::_Lock_policy _Lp = 
> (__gnu_cxx::_Lock_policy)2]':
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:1243:4:
>  required from 'std::__shared_ptr<_Tp, _Lp>::_SafeConv<_Yp> 
> std::__shared_ptr<_Tp, _Lp>::reset(_Yp*) [with _Yp = 
> arrow::RecordBatchReader; _Tp = arrow::RecordBatchReader; 
> __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2; 
> std::__shared_ptr<_Tp, _Lp>::_SafeConv<_Yp> = void]'
> /home/danny/miniconda3/envs/DEV/include/parquet/arrow/reader.h:164:29: 
> required from here
> /home/danny/miniconda3/envs/DEV/x86_64-conda_cos6-linux-gnu/include/c++/7.3.0/bits/shared_ptr_base.h:1082:25:
>  error: invalid application of 'sizeof' to incomplete type 
> 'arrow::RecordBatchReader'
>  static_assert( sizeof(_Yp) > 0, "incomplete type" );
>  ^
> error: Command 
> "/home/danny/miniconda3/envs/DEV/bin/x86_64-conda_cos6-linux-gnu-cc 
> -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall 
> -Wstrict-prototypes -march=nocona -mtune=haswell -ftree-vectorize -fPIC 
> -fstack-protector-strong -fno-plt -O2 -pipe -march=nocona -mtune=haswell 
> -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe 
> -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
> -fno-plt -O2 -ffunction-sections -pipe -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -fPIC 
> -DBUILTIN_PARQUET_READER -I. -I/home/danny/miniconda3/envs/DEV/include 
> 

[jira] [Resolved] (ARROW-7230) [C++] Use vendored std::optional instead of boost::optional in Gandiva

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7230.
-
  Assignee: Neal Richardson  (was: Projjal Chanda)
Resolution: Fixed

This was done in 
https://github.com/apache/arrow/commit/96217193fc726b675969e91e86a63407bc8dce99

> [C++] Use vendored std::optional instead of boost::optional in Gandiva
> --
>
> Key: ARROW-7230
> URL: https://issues.apache.org/jira/browse/ARROW-7230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> This may help with overall codebase consistency



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7179) [C++][Compute] Coalesce kernel

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116227#comment-17116227
 ] 

Wes McKinney commented on ARROW-7179:
-

We can implement this either as a Binary or VarArgs scalar kernel

> [C++][Compute] Coalesce kernel
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Add a kernel which replaces null values in an array with a scalar value or 
> with values taken from another array:
> {code}
> coalesce([1, 2, null, 3], 5) -> [1, 2, 5, 3]
> coalesce([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}
> The code in {{take_internal.h}} should be of some use with a bit of 
> refactoring.
> A filter Expression should be added at the same time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116225#comment-17116225
 ] 

Wes McKinney commented on ARROW-7083:
-

Note that we should be able to add Gandiva-generated kernels (with some glue) 
to {{arrow::compute::Function}} instances. Perhaps we can create an 
{{arrow::compute::GandivaFunction}} that provides the wrapping magic

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116224#comment-17116224
 ] 

Wes McKinney commented on ARROW-7083:
-

I'm inclined to close this issue. After much study, I believe the best that we 
can do is to to take the single-value kernel implementations found in

https://github.com/apache/arrow/tree/master/cpp/src/gandiva/precompiled

and move them to inline-able header files. Then two things happen:

* These inline functions are translated to LLVM IR for use in Gandiva
* The inline functions form the basis for pre-compiled array kernels in 
arrow/compute

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7075) [C++] Boolean kernels should not allocate in Call()

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-7075.
-
Resolution: Fixed

This was done in ARROW-8792

> [C++] Boolean kernels should not allocate in Call()
> ---
>
> Key: ARROW-7075
> URL: https://issues.apache.org/jira/browse/ARROW-7075
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> The boolean kernels currently allocate their value buffers ahead of time but 
> not their null bitmaps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116221#comment-17116221
 ] 

Wes McKinney edited comment on ARROW-7017 at 5/25/20, 7:56 PM:
---

I think the path forward here is to refactor to utilize common implementations 
of inline single-value functions for both the LLVM IR and pre-compiled kernels. 
In other words, what is currently in the gandiva/precompiled directory would be 
moved to some place where we can arrange so that these implementations are 
translated to LLVM IR for use in Gandiva, while available as inline C/C++ 
functions for use in creating pre-compiled vectorized kernels. Having multiple 
implementations of the scalar "unit of work" does not seem desirable

Note that Gandiva-generated kernels should be able (with some glue) to be 
registered in the new general function registry in arrow/compute/registry.h


was (Author: wesmckinn):
I think the path forward here is to refactor to utilize common implementations 
of for both the LLVM IR and pre-compiled kernels. In other words, what is 
currently in the gandiva/precompiled directory would be moved to some place 
where we can arrange so that these implementations are translated to LLVM IR 
for use in Gandiva, while available as inline C/C++ functions for use in 
creating pre-compiled vectorized kernels. Having multiple implementations of 
the scalar "unit of work" does not seem desirable

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116221#comment-17116221
 ] 

Wes McKinney commented on ARROW-7017:
-

I think the path forward here is to refactor to utilize common implementations 
of for both the LLVM IR and pre-compiled kernels. In other words, what is 
currently in the gandiva/precompiled directory would be moved to some place 
where we can arrange so that these implementations are translated to LLVM IR 
for use in Gandiva, while available as inline C/C++ functions for use in 
creating pre-compiled vectorized kernels. Having multiple implementations of 
the scalar "unit of work" does not seem desirable

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7012) [C++] Clarify ChunkedArray chunking strategy and policy

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116220#comment-17116220
 ] 

Wes McKinney commented on ARROW-7012:
-

In general, this is not something that users should be too concerned with. The 
new kernels framework provides a configurability knob 
({{ExecContext::exec_chunksize}}) for selecting the upper limit for the size of 
chunks that are processed

> [C++] Clarify ChunkedArray chunking strategy and policy
> ---
>
> Key: ARROW-7012
> URL: https://issues.apache.org/jira/browse/ARROW-7012
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> See discussion on ARROW-6784 and [https://github.com/apache/arrow/pull/5686]. 
> Among the questions:
>  * Do Arrow users control the chunking, or is it an internal implementation 
> detail they should not manage?
>  * If users control it, how do they control it? E.g. if I call Take and use a 
> ChunkedArray for the indices to take, does the chunking follow how the 
> indices are chunked? Or should we attempt to preserve the mapping of data to 
> their chunks in the input table/chunked array?
>  * If it's an implementation detail, what is the optimal chunk size? And when 
> is it worth reshaping (concatenating, slicing) input data to attain this 
> optimal size? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8905) [C++] Collapse Take APIs from 8 to 1 or 2

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8905.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Duplicate

dup of ARROW-7009

> [C++] Collapse Take APIs from 8 to 1 or 2
> -
>
> Key: ARROW-8905
> URL: https://issues.apache.org/jira/browse/ARROW-8905
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> There are currently 8 {{arrow::compute::Take}} functions with different 
> function signatures. Fewer functions would make life easier for binding 
> developers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8938) [R] Provide binding and argument packing to use arrow::compute::CallFunction to use any compute kernel from R dynamically

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8938:
---

 Summary: [R] Provide binding and argument packing to use 
arrow::compute::CallFunction to use any compute kernel from R dynamically
 Key: ARROW-8938
 URL: https://issues.apache.org/jira/browse/ARROW-8938
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney
 Fix For: 1.0.0


This will drastically simplify exposing new functions to R users



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6982) [R] Add bindings for compare and boolean kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116218#comment-17116218
 ] 

Wes McKinney commented on ARROW-6982:
-

Like ARROW-6978, wrapping {{CallFunction}} would allow dynamic invocation of 
any kernel from R

> [R] Add bindings for compare and boolean kernels
> 
>
> Key: ARROW-6982
> URL: https://issues.apache.org/jira/browse/ARROW-6982
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain Francois
>Priority: Major
>
> See cpp/src/arrow/compute/kernels/compare.h and boolean.h. ARROW-6980 
> introduces an Expression class that works on Arrow Arrays, but to evaluate 
> the expressions, it has to pull the data into R first. This would enable us 
> to do the work in C++ and only pull in the result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6978) [R] Add bindings for sum and mean compute kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116215#comment-17116215
 ] 

Wes McKinney commented on ARROW-6978:
-

R should expose {{arrow::compute::CallFunction}} so that kernel bindings can be 
provided without having to touch C++ code

> [R] Add bindings for sum and mean compute kernels
> -
>
> Key: ARROW-6978
> URL: https://issues.apache.org/jira/browse/ARROW-6978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6959) [C++] Clarify what signatures are preferred for compute kernels

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6959.
---
  Assignee: Wes McKinney
Resolution: Fixed

This is addressed by the new {{arrow::compute::CallFunction}} API

> [C++] Clarify what signatures are preferred for compute kernels
> ---
>
> Key: ARROW-6959
> URL: https://issues.apache.org/jira/browse/ARROW-6959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: compute
> Fix For: 1.0.0
>
>
> Many of the compute kernels feature functions which accept only array inputs 
> in addition to functions which accept Datums. The former seems implicitly 
> like a convenience wrapper around the latter but I don't think this is 
> explicit anywhere. Is there a preferred overload for bindings to use? Is it 
> preferred that C++ implementers provide convenience wrappers for different 
> permutations of argument type? (for example, Filter now provides an overload 
> for record batch input as well as array input)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6956) [C++] Status should use unique_ptr

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6956.
---
Resolution: Won't Fix

I'm not comfortable with this. I think this falls into the "if it ain't broke" 
category 

> [C++] Status should use unique_ptr
> --
>
> Key: ARROW-6956
> URL: https://issues.apache.org/jira/browse/ARROW-6956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> The logic of Status::State is _very_  similar to unique_ptr except the deep 
> copy on copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6856:

Fix Version/s: 1.0.0

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6923) [C++] Option for Filter kernel how to handle nulls in the selection vector

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6923:

Fix Version/s: 1.0.0

> [C++] Option for Filter kernel how to handle nulls in the selection vector
> --
>
> Key: ARROW-6923
> URL: https://issues.apache.org/jira/browse/ARROW-6923
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> How nulls are handled in the boolean mask (selection vector) in a filter 
> kernel varies between languages / data analytics systems (e.g. base R 
> propagates nulls, dplyr R skips (sees as False), SQL generally skips them as 
> well I think, Julia raises an error).
> Currently, in Arrow C++ we "propagate" nulls (null in the selection vector 
> gives a null in the output):
> {code}
> In [7]: arr = pa.array([1, 2, 3]) 
> In [8]: mask = pa.array([True, False, None]) 
> In [9]: arr.filter(mask) 
> Out[9]: 
> 
> [
>   1,
>   null
> ]
> {code}
> Given the different ways this could be done (propagate, skip, error), should 
> we provide an option to control this behaviour?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6856) [C++] Use ArrayData instead of Array for ArrayData::dictionary

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116211#comment-17116211
 ] 

Wes McKinney commented on ARROW-6856:
-

Yes. I just added to the milestone

> [C++] Use ArrayData instead of Array for ArrayData::dictionary
> --
>
> Key: ARROW-6856
> URL: https://issues.apache.org/jira/browse/ARROW-6856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This would be helpful for consistency. {{DictionaryArray}} may want to cache 
> a "boxed" version of this to return from {{DictionaryArray::dictionary}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6799) [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6799.
---
Resolution: Cannot Reproduce

This is no longer an issue because Flatbuffers is not in our toolchain anymore

> [C++] Plasma JNI component links to flatbuffers::flatbuffers (unnecessarily?)
> -
>
> Key: ARROW-6799
> URL: https://issues.apache.org/jira/browse/ARROW-6799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Does not appear to be tested in CI. Originally reported at 
> https://github.com/apache/arrow/issues/5575



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6523) [C++][Dataset] arrow_dataset target does not depend on anything

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6523:

Fix Version/s: 1.0.0

> [C++][Dataset] arrow_dataset target does not depend on anything
> ---
>
> Key: ARROW-6523
> URL: https://issues.apache.org/jira/browse/ARROW-6523
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Other subcomponents have targets to allow their libraries or unit tests to be 
> specifically built



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6514) [Developer][C++][CMake] LLVM tools are restricted to the exact version 7.0

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6514.
---
Resolution: Not A Problem

Closing since we've moved on from LLVM 7

> [Developer][C++][CMake] LLVM tools are restricted to the exact version 7.0
> --
>
> Key: ARROW-6514
> URL: https://issues.apache.org/jira/browse/ARROW-6514
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>
> I have LLVM 7.1 installed locally, and FindClangTools couldn't locate it 
> because ARROW_LLVM_VERSION is [hardcoded to 
> 7.0|https://github.com/apache/arrow/blob/3f2a33f902983c0d395e0480e8a8df40ed5da29c/cpp/CMakeLists.txt#L91-L99]
>  and clang tools is [restricted to the minor 
> version|https://github.com/apache/arrow/blob/3f2a33f902983c0d395e0480e8a8df40ed5da29c/cpp/cmake_modules/FindClangTools.cmake#L78].
> If it makes sense to restrict clang tools location down to the minor version, 
> then we need to pass the located LLVM's version instead of the hardcoded one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6548) [Python] consistently handle conversion of all-NaN arrays across types

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6548:

Fix Version/s: 1.0.0

> [Python] consistently handle conversion of all-NaN arrays across types
> --
>
> Key: ARROW-6548
> URL: https://issues.apache.org/jira/browse/ARROW-6548
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> In ARROW-5682 (https://github.com/apache/arrow/pull/5333), next to fixing 
> actual conversion bugs, I added the ability to convert all-NaN float arrays 
> when converting to string type (and only with {{from_pandas=True}}). So this 
> now works:
> {code}
> >>> pa.array(np.array([np.nan, np.nan], dtype=float), type=pa.string())
> 
> [
>   null,
>   null
> ]
> {code}
> However, I only added this for string type (and it already works for float 
> and int types). If we are happy with this behaviour, we should also add it 
> for other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6456:
---

Assignee: Wes McKinney

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6456) [C++] Possible to reduce object code generated in compute/kernels/take.cc?

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116202#comment-17116202
 ] 

Wes McKinney commented on ARROW-6456:
-

I will take care of this. 

> [C++] Possible to reduce object code generated in compute/kernels/take.cc?
> --
>
> Key: ARROW-6456
> URL: https://issues.apache.org/jira/browse/ARROW-6456
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> According to 
> https://gist.github.com/wesm/90f73d050a81cbff6772aea2203cdf93
> take.cc is our largest piece of object code in the codebase. This is a pretty 
> important function but I wonder if it's possible to make the implementation 
> "leaner" than it is currently to reduce generated code, without sacrificing 
> performance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6261) [C++] Install any bundled components and add installed CMake or pkgconfig configuration to enable downstream linkers to utilize bundled libraries when statically linking

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6261.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

Closing in favor of the approach of splicing the bundled dependencies into 
libarrow.a

> [C++] Install any bundled components and add installed CMake or pkgconfig 
> configuration to enable downstream linkers to utilize bundled libraries when 
> statically linking
> -
>
> Key: ARROW-6261
> URL: https://issues.apache.org/jira/browse/ARROW-6261
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> The objective of this change would be to make it easier for toolchain 
> builders to ship bundled thirdparty libraries together with the Arrow 
> libraries in case there is a particular library version that is only used 
> when linking with {{libarrow.a}}. In theory configuration could be added to 
> arrowTargets.cmake (or pkgconfig) to simplify static linking



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6124) [C++] ArgSort kernel should sort in a single pass (with nulls)

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6124.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

Sorting on large or chunked inputs will probably not be achieved by a 
VectorKernel, but rather than a query execution node similar to various open 
source analytic databases

> [C++] ArgSort kernel should sort in a single pass (with nulls)
> --
>
> Key: ARROW-6124
> URL: https://issues.apache.org/jira/browse/ARROW-6124
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> There's a good chance that merge sort must be implemented (spill to disk, 
> ChunkedArray, ...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6123) [C++] ArgSort kernel should not materialize the output internal

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116197#comment-17116197
 ] 

Wes McKinney commented on ARROW-6123:
-

[~fsaintjacques] could you clarify what you mean?

> [C++] ArgSort kernel should not materialize the output internal
> ---
>
> Key: ARROW-6123
> URL: https://issues.apache.org/jira/browse/ARROW-6123
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It should use the helpers since the output size is known.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6122) [C++] SortToIndices kernel must support FixedSizeBinary

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6122:

Summary: [C++] SortToIndices kernel must support FixedSizeBinary  (was: 
[C++] ArgSort kernel must support FixedSizeBinary)

> [C++] SortToIndices kernel must support FixedSizeBinary
> ---
>
> Key: ARROW-6122
> URL: https://issues.apache.org/jira/browse/ARROW-6122
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5980) [Python] Missing libarrow.so and libarrow_python.so in wheel file

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5980.
---
Resolution: Not A Problem

Our current wheels don't have this problem

> [Python] Missing libarrow.so and libarrow_python.so in wheel file
> -
>
> Key: ARROW-5980
> URL: https://issues.apache.org/jira/browse/ARROW-5980
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.14.0
>Reporter: Haowei Yu
>Priority: Major
>  Labels: wheel
>
> I have installed the pyarrow 0.14.0 but it seems that by default you did not 
> provide symlink of libarrow.so and libarrow_python.so. Only .so file with 
> version suffix is provided. Hence, I cannot use the output of 
> pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link 
> option. 
> If you provide symlink, I can pass following to the linker to specify the 
> library to link. e.g. g++ -L/ -larrow -larrow_python 
> However, right now, the ld ouput complains not being able to find -larrow and 
> -larrow_python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5916.
---
Resolution: Later

We didn't reach a conclusion on this so closing for now

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
> Attachments: test.arrow_ipc
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5760) [C++] Optimize Take and Filter

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116112#comment-17116112
 ] 

Wes McKinney commented on ARROW-5760:
-

I'd like to work on this next week if it's alright

> [C++] Optimize Take and Filter
> --
>
> Key: ARROW-5760
> URL: https://issues.apache.org/jira/browse/ARROW-5760
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some question of whether these kernels allocate optimally- for 
> example when Filtering or Taking strings it might be more efficient to pass 
> over the filter/indices twice, first to determine how much character storage 
> will be needed then again into allocated memory: 
> https://github.com/apache/arrow/pull/4531#discussion_r297160457
> Additionally, these kernels could probably make good use of scatter/gather 
> SIMD instructions.
> Furthermore, Filter's bitmap is currently lazily expanded into the indices of 
> elements to be appended to the output array. It would probably be more 
> efficient to expand to indices in batches, then gather using an index batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5854) [Python] Expose compare kernels on Array class

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5854:

Fix Version/s: (was: 2.0.0)
   1.0.0

> [Python] Expose compare kernels on Array class
> --
>
> Key: ARROW-5854
> URL: https://issues.apache.org/jira/browse/ARROW-5854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> Expose the compare kernel for comparing with scalar or array (ARROW-3087, 
> ARROW-4990) on the python Array class.
> This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5854) [Python] Expose compare kernels on Array class

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116113#comment-17116113
 ] 

Wes McKinney commented on ARROW-5854:
-

This should be fairly trivial now

> [Python] Expose compare kernels on Array class
> --
>
> Key: ARROW-5854
> URL: https://issues.apache.org/jira/browse/ARROW-5854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> Expose the compare kernel for comparing with scalar or array (ARROW-3087, 
> ARROW-4990) on the python Array class.
> This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5760) [C++] Optimize Take and Filter

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5760:

Fix Version/s: (was: 2.0.0)
   1.0.0

> [C++] Optimize Take and Filter
> --
>
> Key: ARROW-5760
> URL: https://issues.apache.org/jira/browse/ARROW-5760
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some question of whether these kernels allocate optimally- for 
> example when Filtering or Taking strings it might be more efficient to pass 
> over the filter/indices twice, first to determine how much character storage 
> will be needed then again into allocated memory: 
> https://github.com/apache/arrow/pull/4531#discussion_r297160457
> Additionally, these kernels could probably make good use of scatter/gather 
> SIMD instructions.
> Furthermore, Filter's bitmap is currently lazily expanded into the indices of 
> elements to be appended to the output array. It would probably be more 
> efficient to expand to indices in batches, then gather using an index batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5530) [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116111#comment-17116111
 ] 

Wes McKinney commented on ARROW-5530:
-

a HashOptions would also need to be introduced

> [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null 
> behavior
> 
>
> Key: ARROW-5530
> URL: https://issues.apache.org/jira/browse/ARROW-5530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5760) [C++] Optimize Take and Filter

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5760:
---

Assignee: Wes McKinney  (was: Ben Kietzman)

> [C++] Optimize Take and Filter
> --
>
> Key: ARROW-5760
> URL: https://issues.apache.org/jira/browse/ARROW-5760
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> There is some question of whether these kernels allocate optimally- for 
> example when Filtering or Taking strings it might be more efficient to pass 
> over the filter/indices twice, first to determine how much character storage 
> will be needed then again into allocated memory: 
> https://github.com/apache/arrow/pull/4531#discussion_r297160457
> Additionally, these kernels could probably make good use of scatter/gather 
> SIMD instructions.
> Furthermore, Filter's bitmap is currently lazily expanded into the indices of 
> elements to be appended to the output array. It would probably be more 
> efficient to expand to indices in batches, then gather using an index batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-5489) [C++] Normalize kernels and ChunkedArray behavior

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5489.
-
Fix Version/s: 1.0.0
 Assignee: Wes McKinney
   Resolution: Fixed

This is done in ARROW-8792

> [C++] Normalize kernels and ChunkedArray behavior
> -
>
> Key: ARROW-5489
> URL: https://issues.apache.org/jira/browse/ARROW-5489
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Some kernels (the wrappers, e.g. Unique) support ChunkedArray inputs, and 
> some don't. We should normalize this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5506) [C++] "Shredder" and "stitcher" functionality

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5506:

Fix Version/s: (was: 2.0.0)

> [C++] "Shredder" and "stitcher" functionality
> -
>
> Key: ARROW-5506
> URL: https://issues.apache.org/jira/browse/ARROW-5506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Andrei Gudkov
>Priority: Major
>
> Discussion is here: [https://github.com/apache/arrow/pull/4066]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5506) [C++] "Shredder" and "stitcher" functionality

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5506.
---
Resolution: Won't Fix

> [C++] "Shredder" and "stitcher" functionality
> -
>
> Key: ARROW-5506
> URL: https://issues.apache.org/jira/browse/ARROW-5506
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Andrei Gudkov
>Priority: Major
>
> Discussion is here: [https://github.com/apache/arrow/pull/4066]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5193) [C++] Linker error with bundled zlib

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5193.
---
Resolution: Fixed

I believe this is fixed now

> [C++] Linker error with bundled zlib
> 
>
> Key: ARROW-5193
> URL: https://issues.apache.org/jira/browse/ARROW-5193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code}
> [98/146] Linking CXX executable debug/flight-test-integration-server
> FAILED: debug/flight-test-integration-server 
> : && /usr/bin/ccache /usr/lib/ccache/c++  -Wno-noexcept-type  
> -fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
> -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2  -g  -rdynamic 
> src/arrow/flight/CMakeFiles/flight-test-integration-server.dir/test-integration-server.cc.o
>   -o debug/flight-test-integration-server  
> -Wl,-rpath,/home/antoine/arrow/bundledeps/cpp/build-test/debug 
> debug/libarrow_flight_testing.so.14.0.0 debug/libarrow_testing.so.14.0.0 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlienc-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlidec-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlicommon-static.a -ldl 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> /usr/lib/x86_64-linux-gnu/libboost_system.so 
> /usr/lib/x86_64-linux-gnu/libboost_filesystem.so 
> /usr/lib/x86_64-linux-gnu/libboost_regex.so 
> googletest_ep-prefix/src/googletest_ep/lib/libgtest_maind.a 
> googletest_ep-prefix/src/googletest_ep/lib/libgtestd.a 
> googletest_ep-prefix/src/googletest_ep/lib/libgmockd.a -ldl 
> ../thirdparty/protobuf_ep-install/lib/libprotobuf.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc++.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc.a 
> ../thirdparty/grpc_ep-install/lib/libgpr.a 
> ../thirdparty/cares_ep-install/lib/libcares.a 
> ../thirdparty/grpc_ep-install/lib/libaddress_sorting.a 
> gflags_ep-prefix/src/gflags_ep/lib/libgflags.a 
> googletest_ep-prefix/src/googletest_ep/lib/libgtestd.a 
> debug/libarrow_flight.so.14.0.0 
> ../thirdparty/protobuf_ep-install/lib/libprotobuf.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc++.a 
> ../thirdparty/grpc_ep-install/lib/libgrpc.a 
> ../thirdparty/grpc_ep-install/lib/libgpr.a 
> ../thirdparty/cares_ep-install/lib/libcares.a 
> ../thirdparty/grpc_ep-install/lib/libaddress_sorting.a 
> /usr/lib/x86_64-linux-gnu/libboost_system.so debug/libarrow.so.14.0.0 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlienc-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlidec-static.a 
> brotli_ep/src/brotli_ep-install/lib/libbrotlicommon-static.a -ldl 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -pthread -lrt 
> && :
> debug/libarrow_flight.so.14.0.0: undefined reference to `inflateInit2_'
> debug/libarrow_flight.so.14.0.0: undefined reference to `inflate'
> debug/libarrow_flight.so.14.0.0: undefined reference to `deflateInit2_'
> debug/libarrow_flight.so.14.0.0: undefined reference to `deflate'
> debug/libarrow_flight.so.14.0.0: undefined reference to `deflateEnd'
> debug/libarrow_flight.so.14.0.0: undefined reference to `inflateEnd'
> collect2: error: ld returned 1 exit status
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8933) [C++] Reduce generated code in vector_hash.cc

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8933:
---

 Summary: [C++] Reduce generated code in vector_hash.cc
 Key: ARROW-8933
 URL: https://issues.apache.org/jira/browse/ARROW-8933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Since hashing doesn't need to know about logical types, we can do the following:

* Use same generated code for both BinaryType and StringType
* Use same generated code for primitive types having the same byte width

These two changes should reduce binary size and improve compilation speed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5005) [C++] Implement support for using selection vectors in scalar aggregate function kernels

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5005:

Summary: [C++] Implement support for using selection vectors in scalar 
aggregate function kernels  (was: [C++] Add support for filter mask in 
AggregateFunction)

> [C++] Implement support for using selection vectors in scalar aggregate 
> function kernels
> 
>
> Key: ARROW-5005
> URL: https://issues.apache.org/jira/browse/ARROW-5005
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The aggregate kernels don't support mask (the result of a filter). Add the 
> the following method to `AggregateFunction`.
> {code:c++}
> virtual Status ConsumeWithFilter(const Array& input, const Array& mask, void* 
> state) const = 0;
> {code}
> The goal is to add support for AST similar to:
> {code:sql}
> SELECT AGG(x) FROM table WHERE pred;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5005) [C++] Implement support for using selection vectors in scalar aggregate function kernels

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116107#comment-17116107
 ] 

Wes McKinney commented on ARROW-5005:
-

I believe the best approach right now is to use selection vectors for this (see 
{{arrow::compute::ExecBatch::selection_vector}})

> [C++] Implement support for using selection vectors in scalar aggregate 
> function kernels
> 
>
> Key: ARROW-5005
> URL: https://issues.apache.org/jira/browse/ARROW-5005
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The aggregate kernels don't support mask (the result of a filter). Add the 
> the following method to `AggregateFunction`.
> {code:c++}
> virtual Status ConsumeWithFilter(const Array& input, const Array& mask, void* 
> state) const = 0;
> {code}
> The goal is to add support for AST similar to:
> {code:sql}
> SELECT AGG(x) FROM table WHERE pred;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116106#comment-17116106
 ] 

Wes McKinney edited comment on ARROW-5002 at 5/25/20, 3:10 PM:
---

I renamed the issue. I need to be able to execute hash aggregations in the next 
few months so I be working to implement the appropriate machinery for this 
under arrow/compute (since hash aggregations need to compose with array/kernel 
expressions)


was (Author: wesmckinn):
I renamed the issue. I need to be able to execute hash aggregations in the next 
few months so I will implement the appropriate machinery for this under 
arrow/compute (since hash aggregations need to compose with array/kernel 
expressions)

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: query-engine
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116106#comment-17116106
 ] 

Wes McKinney commented on ARROW-5002:
-

I renamed the issue. I need to be able to execute hash aggregations in the next 
few months so I will implement the appropriate machinery for this under 
arrow/compute (since hash aggregations need to compose with array/kernel 
expressions)

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: query-engine
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5002:

Labels: query-engine  (was: )

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: query-engine
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5002) [C++] Implement Hash Aggregation query execution node

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5002:

Summary: [C++] Implement Hash Aggregation query execution node  (was: [C++] 
Implement GroupBy)

> [C++] Implement Hash Aggregation query execution node
> -
>
> Key: ARROW-5002
> URL: https://issues.apache.org/jira/browse/ARROW-5002
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Dear all,
> I wonder what the best way forward is for implementing GroupBy kernels. 
> Initially this was part of
> https://issues.apache.org/jira/browse/ARROW-4124
> but is not contained in the current implementation as far as I can tell.
> It seems that the part of group by that just returns indices could be 
> conveniently implemented with the HashKernel. That seems useful in any case. 
> Is that indeed the best way forward/should this be done?
> GroupBy + Aggregate could then either be implemented with that + the Take 
> kernel + aggregation involving more memory copies than necessary though or as 
> part of the aggregate kernel. Probably the latter is preferred, any thoughts 
> on that?
> Am I missing any other JIRAs related to this?
> Best, Philipp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-4798) [C++] Re-enable runtime/references cpplint check

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4798.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

The benchmark thing is enough of a nuisance that I won't bother with this. 
We've been pretty effective about catching mutable references in code reviews

> [C++] Re-enable runtime/references cpplint check
> 
>
> Key: ARROW-4798
> URL: https://issues.apache.org/jira/browse/ARROW-4798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> This will help keep the codebase clean.
> We might consider defining some custom filters for cpplint warnings we want 
> to suppress, like it doesn't like {{benchmark::State&}} because of the 
> non-const reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4633:

Fix Version/s: 1.0.0

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Priority: Minor
>  Labels: dataset-parquet-read, newbie, parquet
> Fix For: 1.0.0
>
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4530) [C++] Review Aggregate kernel state allocation/ownership semantics

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116099#comment-17116099
 ] 

Wes McKinney commented on ARROW-4530:
-

You may have noticed that the aggregation API was iterated in ARROW-8792. I 
think the current structure is adequate for non-hash-aggregations, but we 
should think about how to deal with implementing aggregations that can be used 
with hash aggregation (aka "GROUP BY")

> [C++] Review Aggregate kernel state allocation/ownership semantics
> --
>
> Key: ARROW-4530
> URL: https://issues.apache.org/jira/browse/ARROW-4530
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116096#comment-17116096
 ] 

Wes McKinney commented on ARROW-4333:
-

I partially addressed some of these questions in ARROW-8792, but there are 
other questions viz-a-viz memory reuse and dealing with ChunkedArrays. Perhaps 
it would be useful to go through these questions and discuss them in the 
context of the new generic kernel execution framework in arrow/compute

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4333:
---

Assignee: (was: Wes McKinney)

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4333:
---

Assignee: Wes McKinney

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4097) [C++] Add function to "conform" a dictionary array to a target new dictionary

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116095#comment-17116095
 ] 

Wes McKinney commented on ARROW-4097:
-

This can be implemented as a ScalarFunction I think

> [C++] Add function to "conform" a dictionary array to a target new dictionary
> -
>
> Key: ARROW-4097
> URL: https://issues.apache.org/jira/browse/ARROW-4097
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Follow up work to ARROW-554. 
> Unifying multiple dictionary-encoded arrays is one use case. Another is 
> rewriting a DictionaryArray to be based on another dictionary. For example, 
> this would be used to implement Cast from one dictionary type to another.
> This will need to be able to insert nulls where there are values that are not 
> found in the target dictionary
> see also discussion at 
> https://github.com/apache/arrow/pull/3165#discussion_r243025730



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3978) [C++] Implement hashing, dictionary-encoding for StructArray

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3978:

Labels: query-engine  (was: )

> [C++] Implement hashing, dictionary-encoding for StructArray
> 
>
> Key: ARROW-3978
> URL: https://issues.apache.org/jira/browse/ARROW-3978
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: query-engine
>
> This is a central requirement for hash-aggregations such as
> {code}
> SELECT AGG_FUNCTION(expr)
> FROM table
> GROUP BY expr1, expr2, ...
> {code}
> The materialized keys in the GROUP BY section form a struct, which can be 
> incrementally hashed to produce dictionary codes suitable for computing 
> aggregates or any other purpose. 
> There are a few subtasks related to this, such as efficiently constructing a 
> record (that can be hashed quickly) to identify each "row" in the struct. 
> Maybe we should start with that first



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3822) [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on row groups with chunked columns

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3822:

Fix Version/s: 1.0.0

> [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on 
> row groups with chunked columns
> ---
>
> Key: ARROW-3822
> URL: https://issues.apache.org/jira/browse/ARROW-3822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> If a BinaryArray / StringArray overflows a single column when reading a row 
> group, the resulting table will have a ChunkedArray. Using TableBatchReader 
> in 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
> will therefore only return a part of the row group, discarding the rest



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8937) [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8937:
---

 Summary: [C++] Add "parse_strptime" function for string to 
timestamp conversions using the kernels framework
 Key: ARROW-8937
 URL: https://issues.apache.org/jira/browse/ARROW-8937
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This should be relatively straightforward to implement using the new kernels 
framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-3372) [C++] Introduce SlicedBuffer class

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3372.
---
Resolution: Won't Fix

> [C++] Introduce SlicedBuffer class
> --
>
> Key: ARROW-3372
> URL: https://issues.apache.org/jira/browse/ARROW-3372
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The purpose of this class will be to forward certain function calls to the 
> parent buffer, like a request for the device (CPU, GPU, etc.).
> As a result of this, we can remove the {{parent_}} member from {{Buffer}} as 
> that member is only there to support slices. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1846) [C++] Implement "any" reduction kernel for boolean data

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116060#comment-17116060
 ] 

Wes McKinney commented on ARROW-1846:
-

With fresh eyes and ARROW-8792 in the rear view mirror, I believe Any should be 
implemented as a ScalarAggregateFunction, with some way for agg functions to 
communicate that they have short-circuited to the KernelContext

> [C++] Implement "any" reduction kernel for boolean data
> ---
>
> Key: ARROW-1846
> URL: https://issues.apache.org/jira/browse/ARROW-1846
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics, dataframe
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116043#comment-17116043
 ] 

Wes McKinney edited comment on ARROW-971 at 5/25/20, 2:03 PM:
--

The correct way to implement is now as {{arrow::compute::ScalarFunction}}


was (Author: wesmckinn):
The correct way to implement is as {{arrow::compute::ScalarFunction}}

> [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116062#comment-17116062
 ] 

Wes McKinney commented on ARROW-1888:
-

This should be implemented in scalar_cast_nested.cc

> [C++] Implement casts from one struct type to another (with same field names 
> and number of fields)
> --
>
> Key: ARROW-1888
> URL: https://issues.apache.org/jira/browse/ARROW-1888
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116043#comment-17116043
 ] 

Wes McKinney commented on ARROW-971:


The correct way to implement is as {{arrow::compute::ScalarFunction}}

> [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1570.
-
Fix Version/s: 1.0.0
 Assignee: Wes McKinney
   Resolution: Fixed

This was basically achieved in ARROW-8792. Further work can be done with 
specific follow ups

> [C++] Define API for creating a kernel instance from function of scalar input 
> and output with a particular signature
> 
>
> Key: ARROW-1570
> URL: https://issues.apache.org/jira/browse/ARROW-1570
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 1.0.0
>
>
> This could include an {{std::function}} instance (but these cannot be inlined 
> by the C++ compiler), but should also permit use with inline-able functions 
> or functors



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1574) [C++] Implement kernel function that converts a dense array to dictionary given known dictionary

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116054#comment-17116054
 ] 

Wes McKinney commented on ARROW-1574:
-

This would be a useful expansion of the functions in vector_hash.cc. We must 
introduce a {{HashOptions}} to be able to supply the known dictionary when 
invoking the functions

> [C++] Implement kernel function that converts a dense array to dictionary 
> given known dictionary
> 
>
> Key: ARROW-1574
> URL: https://issues.apache.org/jira/browse/ARROW-1574
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This may simply be a special case of cast using a dictionary type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1568) [C++] Implement "drop null" kernels that return array without nulls

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116050#comment-17116050
 ] 

Wes McKinney commented on ARROW-1568:
-

This can be implemented as a {{arrow::compute::VectorFunction}} because the 
size of the array is changed, so this function is not valid in a SQL-like 
context

> [C++] Implement "drop null" kernels that return array without nulls
> ---
>
> Key: ARROW-1568
> URL: https://issues.apache.org/jira/browse/ARROW-1568
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1761) [C++] Multi argument operator kernel behavior for decimal columns

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116057#comment-17116057
 ] 

Wes McKinney commented on ARROW-1761:
-

This will need to be resolved once adding Decimal support to 
compute/kernels/scalar_arithmetic.cc

> [C++] Multi argument operator kernel behavior for decimal columns
> -
>
> Key: ARROW-1761
> URL: https://issues.apache.org/jira/browse/ARROW-1761
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Priority: Major
> Fix For: 2.0.0
>
>
> This is a JIRA to discuss the behavior of operator kernels that require more 
> than one decimal column input where the column types have a different 
> {{scale}} parameter.
> For example:
> {code}
> a: decimal(12, 2)
> b: decimal(10, 3)
> c = a + b
> {code}
> Arithmetic is the primary use case, but anything that needs to efficiently 
> operate on decimal columns with different scales would require this 
> functionality.
> I imagine that [~jnadeau] and folks at Dremio have thought about and solved 
> the problem in Java. If so, we should consider implementing this behavior in 
> C++. Otherwise, I'll do a bit of reading and digging to see how existing 
> systems efficiently handle this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1699) [C++] Forward, backward fill kernel functions

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116055#comment-17116055
 ] 

Wes McKinney commented on ARROW-1699:
-

These are VECTOR functions

> [C++] Forward, backward fill kernel functions
> -
>
> Key: ARROW-1699
> URL: https://issues.apache.org/jira/browse/ARROW-1699
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
> Fix For: 2.0.0
>
>
> Like ffill / bfill in pandas (with limit)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1569) [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116051#comment-17116051
 ] 

Wes McKinney commented on ARROW-1569:
-

This can be implemented as a {{ScalarAggregateFunction}}. We should consider 
how to enable aggregate functions to short-circuit

> [C++] Kernel functions for determining monotonicity (ascending or descending) 
> for well-ordered types
> 
>
> Key: ARROW-1569
> URL: https://issues.apache.org/jira/browse/ARROW-1569
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> These kernels must offer some stateful variant so that monotonicity can be 
> determined across chunked arrays



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8936) [C++] Parallelize execution of arrow::compute::ScalarFunction

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8936:
---

 Summary: [C++] Parallelize execution of 
arrow::compute::ScalarFunction
 Key: ARROW-8936
 URL: https://issues.apache.org/jira/browse/ARROW-8936
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3120) [C++] Parallelize execution of ScalarAggregateFunction

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3120:

Description: After ARROW-8972, we have a generic chunk-based executor for 
aggregate functions. It should be relatively straightforward now to parallelize 
the execution loop of the executor.   (was: The general API for aggregation 
should be something like

{code}
Aggregator* aggregator = ...;
const Array& chunk = ...;

RETURN_NOT_OK(aggregator->Update(chunk));
{code})

> [C++] Parallelize execution of ScalarAggregateFunction
> --
>
> Key: ARROW-3120
> URL: https://issues.apache.org/jira/browse/ARROW-3120
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
> Fix For: 2.0.0
>
>
> After ARROW-8972, we have a generic chunk-based executor for aggregate 
> functions. It should be relatively straightforward now to parallelize the 
> execution loop of the executor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3120) [C++] Parallelize execution of ScalarAggregateFunction

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3120:

Summary: [C++] Parallelize execution of ScalarAggregateFunction  (was: 
[C++] Thread-safe parallel aggregator)

> [C++] Parallelize execution of ScalarAggregateFunction
> --
>
> Key: ARROW-3120
> URL: https://issues.apache.org/jira/browse/ARROW-3120
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
> Fix For: 2.0.0
>
>
> The general API for aggregation should be something like
> {code}
> Aggregator* aggregator = ...;
> const Array& chunk = ...;
> RETURN_NOT_OK(aggregator->Update(chunk));
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-2685) [C++] Implement kernels for in-place sorting of fixed-width contiguous arrays

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2685.
---
Fix Version/s: (was: 2.0.0)
   Resolution: Won't Fix

Since we have SortToIndices and Arrow arrays have immutable semantics, this 
should be solved a different way

> [C++] Implement kernels for in-place sorting of fixed-width contiguous arrays
> -
>
> Key: ARROW-2685
> URL: https://issues.apache.org/jira/browse/ARROW-2685
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: analytics
>
> See discussion in https://github.com/apache/arrow/issues/2112. A kernel may 
> want to throw if the memory being sorted is shared (in which case the user 
> should copy, then sort)
> Sorting of chunked data is a more complex topic so that's out of scope for 
> this task



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-3079) [C++] Create initial collection of "ill-behaved CSVs" in apache/arrow-testing

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3079.
---
Resolution: Later

Let's address such issues incrementally as they arise

> [C++] Create initial collection of "ill-behaved CSVs" in apache/arrow-testing
> -
>
> Key: ARROW-3079
> URL: https://issues.apache.org/jira/browse/ARROW-3079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1567) [C++] Implement "fill null" kernels that replace null values with some scalar replacement value

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116049#comment-17116049
 ] 

Wes McKinney commented on ARROW-1567:
-

We can implement a new function with kernels of form {{(array[T], scalar[T]) -> 
array[T]}}

> [C++] Implement "fill null" kernels that replace null values with some scalar 
> replacement value
> ---
>
> Key: ARROW-1567
> URL: https://issues.apache.org/jira/browse/ARROW-1567
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1565) [C++] Implement TopK/BottomK streaming execution nodes

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116048#comment-17116048
 ] 

Wes McKinney commented on ARROW-1565:
-

I reframed this issue as a query processing task. We need to be able to compute 
TopK/BottomK with chunked data

> [C++] Implement TopK/BottomK streaming execution nodes
> --
>
> Key: ARROW-1565
> URL: https://issues.apache.org/jira/browse/ARROW-1565
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 2.0.0
>
>
> Heap-based topk can compute these indices in O(n log k) time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1489) [C++] Add casting option to set unsafe casts to null rather than some garbage value

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116047#comment-17116047
 ] 

Wes McKinney commented on ARROW-1489:
-

This might yield some code bloat but would still be useful to have in our 
repertoire 

> [C++] Add casting option to set unsafe casts to null rather than some garbage 
> value
> ---
>
> Key: ARROW-1489
> URL: https://issues.apache.org/jira/browse/ARROW-1489
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Null is the obvious choice when certain casts fail, like string to number, 
> but in other kinds of unsafe casts there may be more ambiguity. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1329) [C++] Define "virtual table" interface

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116046#comment-17116046
 ] 

Wes McKinney commented on ARROW-1329:
-

Some time has passed but I plan to make some progress on this in June and July

> [C++] Define "virtual table" interface
> --
>
> Key: ARROW-1329
> URL: https://issues.apache.org/jira/browse/ARROW-1329
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> The idea is that a virtual table may reference Arrow data that is not yet 
> available in memory. The implementation will define the semantics of how 
> columns are loaded into memory. 
> A virtual column interface will need to accompany this. For example:
> {code:language=c++}
> std::shared_ptr vtable = ...;
> std::shared_ptr vcolumn = vtable->column(i);
> std::shared_ptr = vcolumn->Materialize();
> std::shared_ptr = vtable->Materialize();
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1565) [C++] Implement TopK/BottomK streaming execution nodes

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1565:

Summary: [C++] Implement TopK/BottomK streaming execution nodes  (was: 
[C++] "argtopk" and "argbottomk" functions for computing indices of largest or 
smallest elements)

> [C++] Implement TopK/BottomK streaming execution nodes
> --
>
> Key: ARROW-1565
> URL: https://issues.apache.org/jira/browse/ARROW-1565
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 2.0.0
>
>
> Heap-based topk can compute these indices in O(n log k) time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1565) [C++] Implement TopK/BottomK streaming execution nodes

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1565:

Labels: Analytics query-engine  (was: Analytics)

> [C++] Implement TopK/BottomK streaming execution nodes
> --
>
> Key: ARROW-1565
> URL: https://issues.apache.org/jira/browse/ARROW-1565
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics, query-engine
> Fix For: 2.0.0
>
>
> Heap-based topk can compute these indices in O(n log k) time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-1133) [C++] Convert all non-accessor function names to PascalCase

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1133.
---
Resolution: Not A Problem

We can go about fixing any mis-named APIs as we run across them

> [C++] Convert all non-accessor function names to PascalCase
> ---
>
> Key: ARROW-1133
> URL: https://issues.apache.org/jira/browse/ARROW-1133
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It seems Google has taken the "cheap functions can be lower case" out of 
> their style guide. I've been asked enough about "which style to use" that I 
> like the idea of UsePascalCaseForEverything 
> https://github.com/google/styleguide/commit/db0a26320f3e930c6ea7225ed53539b4fb31310c#diff-26120df7bca3279afbf749017c778545R4277



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8935) [Python] Add necessary plumbing to enable Numba-generated functions to be registered as functions in the global C++ function/kernels registry

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116080#comment-17116080
 ] 

Wes McKinney commented on ARROW-8935:
-

cc @uwe 

> [Python] Add necessary plumbing to enable Numba-generated functions to be 
> registered as functions in the global C++ function/kernels registry
> -
>
> Key: ARROW-8935
> URL: https://issues.apache.org/jira/browse/ARROW-8935
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8935) [Python] Add necessary plumbing to enable Numba-generated functions to be registered as functions in the global C++ function/kernels registry

2020-05-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-8935:
---

 Summary: [Python] Add necessary plumbing to enable Numba-generated 
functions to be registered as functions in the global C++ function/kernels 
registry
 Key: ARROW-8935
 URL: https://issues.apache.org/jira/browse/ARROW-8935
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2665) [Python/C++] Add index() method to find first occurence of Python scalar

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116079#comment-17116079
 ] 

Wes McKinney commented on ARROW-2665:
-

I suggest implementing this as a short-circuiting aggregate kernel

> [Python/C++] Add index() method to find first occurence of Python scalar
> 
>
> Key: ARROW-2665
> URL: https://issues.apache.org/jira/browse/ARROW-2665
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe Korn
>Priority: Major
>  Labels: Analytics, beginner
> Fix For: 2.0.0
>
>
> Python lists have an {{index(x, start, end)}} method to find the first 
> occurence of an element. We should add a method with the same interface 
> supporting Python scalars on the typical triplet 
> {{Array/ChunkedArray/Columns}}.
> See also 
> https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type

2020-05-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116039#comment-17116039
 ] 

Wes McKinney commented on ARROW-488:


This could be implemented as a standalone function in the new kernels framework

> [Python] Implement conversion between integer coded as floating points with 
> NaN to an Arrow integer type
> 
>
> Key: ARROW-488
> URL: https://issues.apache.org/jira/browse/ARROW-488
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> For example: if pandas has casted integer data to float, this would enable 
> the integer data to be recovered (so long as the values fall in the ~2^53 
> floating point range for exact integer representation)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions

2020-05-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-971:
---
Summary: [C++/Python] Implement Array.isvalid/notnull/isnull as scalar 
functions  (was: [C++/Python] Implement Array.isvalid/notnull/isnull)

> [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >