from:"Wes McKinney \(Jira\)"

[jira] [Created] (ARROW-17296) [Python] Doctest failure in pyarrow.parquet.read_metadata after 10.0.0 dev version update

2022-08-03 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-17296:


 Summary: [Python] Doctest failure in pyarrow.parquet.read_metadata 
after 10.0.0 dev version update
 Key: ARROW-17296
 URL: https://issues.apache.org/jira/browse/ARROW-17296
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 10.0.0


The version update caused the doctest in this function to fail



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17259) [C++] Use shared_ptr less throughout arrow/compute

2022-07-29 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-17259:


 Summary: [C++] Use shared_ptr less throughout 
arrow/compute
 Key: ARROW-17259
 URL: https://issues.apache.org/jira/browse/ARROW-17259
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 10.0.0


It turns out we generate a ton of code just copying and manipulating 
{{shared_ptr}} throughput arrow/compute, and especially in the 
configuration of the function/kernels registry. One function 
{{RegisterScalarArithmetic}} generates around 300kb of code, which on looking 
at disassembly contains a significant amount of inlined shared_ptr template 
code. I made an attempt to refactoring things to use {{const DataType*}} for 
function signatures which removes quite a bit of code bloat, and puts us on a 
path to using fewer shared_ptr's in general



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17135) [C++] Reduce code size in arrow/compute/kernels/scalar_compare.cc

2022-07-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-17135:


 Summary: [C++] Reduce code size in 
arrow/compute/kernels/scalar_compare.cc
 Key: ARROW-17135
 URL: https://issues.apache.org/jira/browse/ARROW-17135
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney


I had noticed the large symbol sizes in scalar_compare.cc when looking at the 
shared library. I had a quick hack on the plane to try to reduce the code size



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17129) [C++][Compute] Improve memory efficiency in Grouper

2022-07-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-17129:


 Summary: [C++][Compute] Improve memory efficiency in Grouper
 Key: ARROW-17129
 URL: https://issues.apache.org/jira/browse/ARROW-17129
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


There are APIs in arrow::compute::Grouper (GetUniques, Consume) which may be 
able to be refactored to write into preallocated memory or otherwise have a 
mode that does less mandatory allocation. We can investigate at some point



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17100) [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353

2022-07-17 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-17100:


 Summary: [C++][Parquet] Fix backwards compatibility for ParquetV2 
data pages written prior to 3.0.0 per ARROW-10353
 Key: ARROW-17100
 URL: https://issues.apache.org/jira/browse/ARROW-17100
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Parquet
Reporter: Wes McKinney
 Fix For: 9.0.0


As described in 
https://lists.apache.org/thread/xkrhgfpk9sr1mj74d4chz3r5yp3szt6c, 

https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cbadc1ae6580789e69

Caused some files written prior to 3.0.0 to be unreadable. Given that the patch 
was small, this will hopefully not be too difficult to fix



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17099) [Python] pyarrow build does not support RELWITHDEBINFO build type

2022-07-17 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-17099:


 Summary: [Python] pyarrow build does not support RELWITHDEBINFO 
build type
 Key: ARROW-17099
 URL: https://issues.apache.org/jira/browse/ARROW-17099
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I ran into this trying to bisect a Parquet regression that occurred between 
2.0.0 and 3.0.0 -- because CMAKE_BUILD_TYPE=debug adds -Werror, this can cause 
builds to fail, but we need debug symbols to use gdb



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-16929) [C++] Remove ExecBatchIterator

2022-06-28 Thread Wes McKinney (Jira)

Title: Message Title


 
 
 
 

 
 
 

 
   
 Wes McKinney created an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Apache Arrow /  ARROW-16929  
 
 
  [C++] Remove ExecBatchIterator   
 

  
 
 
 
 

 
Issue Type: 
  Improvement  
 
 
Assignee: 
 Unassigned  
 
 
Components: 
 C++  
 
 
Created: 
 28/Jun/22 19:48  
 
 
Fix Versions: 
 9.0.0  
 
 
Priority: 
  Major  
 
 
Reporter: 
 Wes McKinney  
 

  
 
 
 
 

 
 The only place left using it is in GroupBy in arrow/compute/exec/aggregate.cc. This can be refactored to use ExecSpan.  As part of this removal, we should adapt the benchmarks for ExecSpanIterator to demonstrate the performance improvement there   
 

  
 
 
 
 

 
 
 

 
 
 Add Comment

[jira] [Created] (ARROW-16852) [C++] Migrate SCALAR_AGGREGATE, HASH_AGGREGATE functions to use ExecSpan

2022-06-17 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16852:


 Summary: [C++] Migrate SCALAR_AGGREGATE, HASH_AGGREGATE functions 
to use ExecSpan
 Key: ARROW-16852
 URL: https://issues.apache.org/jira/browse/ARROW-16852
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 9.0.0


Following work in ARROW-16824



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16847) [C++] Rename or fix compute/kernels/aggregate_{mode, quantile}.cc modules to actually be aggregate functions

2022-06-16 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16847:


 Summary: [C++] Rename or fix compute/kernels/aggregate_{mode, 
quantile}.cc modules to actually be aggregate functions
 Key: ARROW-16847
 URL: https://issues.apache.org/jira/browse/ARROW-16847
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C
Reporter: Wes McKinney
 Fix For: 9.0.0


These modules import VectorFunctions even though their file names state 
otherwise. Either they should implement aggregate functions or the files should 
be renamed to indicate that they are vector functions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16845) [C++] ArraySpan::IsNull/IsValid implementations are incorrect for union types

2022-06-16 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16845:


 Summary: [C++] ArraySpan::IsNull/IsValid implementations are 
incorrect for union types
 Key: ARROW-16845
 URL: https://issues.apache.org/jira/browse/ARROW-16845
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 9.0.0


Because the first buffer is not a validity bitmap. Follow up work from 
ARROW-16756



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16837) [C++] Investigate performance regressions observed in Unique, VisitArraySpanInline

2022-06-15 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16837:


 Summary: [C++] Investigate performance regressions observed in 
Unique, VisitArraySpanInline
 Key: ARROW-16837
 URL: https://issues.apache.org/jira/browse/ARROW-16837
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 9.0.0


See discussion in https://github.com/apache/arrow/pull/13364



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16827) [C++] Refactor internal array sorting code to use ArraySpan

2022-06-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16827:


 Summary: [C++] Refactor internal array sorting code to use 
ArraySpan
 Key: ARROW-16827
 URL: https://issues.apache.org/jira/browse/ARROW-16827
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


I won't be tackling this in ARROW-16824 since this code will require more work 
to port



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16824) [C++] Migrate non-ScalarKernel implementations to use ExecSpan, ArraySpan

2022-06-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16824:


 Summary: [C++] Migrate non-ScalarKernel implementations to use 
ExecSpan, ArraySpan
 Key: ARROW-16824
 URL: https://issues.apache.org/jira/browse/ARROW-16824
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 9.0.0


ARROW-16756 handles the scalar kernels. Migrate the rest of the kernels and 
remove the old ExecBatch-based exec API



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16819) [C++] arrow::compute::CallFunction needs a batch length for nullary functions

2022-06-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16819:


 Summary: [C++] arrow::compute::CallFunction needs a batch length 
for nullary functions
 Key: ARROW-16819
 URL: https://issues.apache.org/jira/browse/ARROW-16819
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 9.0.0


This is a design deficiency in {{CallFunction}}. If a function is nullary, the 
execution machinery has no way to determine the output length from an empty 
vector of datums. We should change {{CallFunction}} to have variants based on 
{{ExecBatch}} and {{ExecSpan}} (from ARROW-16755)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16758) [C++] Rewrite ExecuteScalarExpression to not use ScalarExecutor

2022-06-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16758:


 Summary: [C++] Rewrite ExecuteScalarExpression to not use 
ScalarExecutor
 Key: ARROW-16758
 URL: https://issues.apache.org/jira/browse/ARROW-16758
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


{{ExecuteScalarExpression}} sets up and tears down {{ScalarExecutor}} from 
exec.cc for each node in the expression tree. This adds a ton of overhead from 
moving around non-trivial objects. After ARROW-16756, we should write a new 
ScalarExpressionExecutor which is careful to construct ArraySpans and execute 
the expression tree in a much more lightweight / less bloated fashion. 

Follow on work in a subsequent Jira will add a pool/stack of allocated 
temporary buffers to reuse during execution 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16757) [C++] Remove "scalar" output modality from array kernels

2022-06-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16757:


 Summary: [C++] Remove "scalar" output modality from array kernels
 Key: ARROW-16757
 URL: https://issues.apache.org/jira/browse/ARROW-16757
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Supporting scalar outputs from array kernels (where all the inputs are scalars) 
introduces needless complexity into the kernel implementations, causing 
duplication of effort and excess code generation for paltry benefit. In the 
scenario where all inputs are scalars, it would be better to promote them all 
to arrays of length 1 (either by creating the arrays or constructing an 
appropriate ArraySpan per ARROW-16756) and invoking the array code path. This 
would enable us to delete thousands of lines of code and ease the ongoing 
development and maintenance of the array kernels codebase



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16756) [C++] Introduce initial ArraySpan, ExecSpan non-owning / shared_ptr-free data structures for kernel execution, refactor scalar kernels

2022-06-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16756:


 Summary: [C++] Introduce initial ArraySpan, ExecSpan non-owning / 
shared_ptr-free data structures for kernel execution, refactor scalar kernels
 Key: ARROW-16756
 URL: https://issues.apache.org/jira/browse/ARROW-16756
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 9.0.0


This is essential to reduce microperformance overhead as has been discussed and 
investigated many other places. This first stage of work is to remove the use 
of {{Datum}} and {{ExecBatch}} from the input side of only scalar kernels, so 
that we can work toward using span/view data structures as the inputs (and 
eventually outputs) of all kernels. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16755) [C++] Improve array expression and kernel evaluation performance on small inputs

2022-06-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16755:


 Summary: [C++] Improve array expression and kernel evaluation 
performance on small inputs
 Key: ARROW-16755
 URL: https://issues.apache.org/jira/browse/ARROW-16755
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C
Reporter: Wes McKinney


This is an umbrella issue for a variety of follow-up Jiras to refactor and 
improve the array kernels / function machinery to have less overhead and work 
more efficiently for parallel processing as well as small inputs (down to ~1000 
elements per kernel invocation)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16643) [C++] Fix -Werror CHECKIN build with clang-14

2022-05-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-16643:


 Summary: [C++] Fix -Werror CHECKIN build with clang-14
 Key: ARROW-16643
 URL: https://issues.apache.org/jira/browse/ARROW-16643
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 9.0.0


With clang-14, the C++ build fails on a handful of new warnings including 
{{-Wreturn-stack-address}}. Will submit patch



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-15111) [C++] Implement ODBC driver "wrapper" using FlightSQL

2021-12-14 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-15111:


 Summary: [C++] Implement ODBC driver "wrapper" using FlightSQL
 Key: ARROW-15111
 URL: https://issues.apache.org/jira/browse/ARROW-15111
 Project: Apache Arrow
  Issue Type: New Feature
  Components: FlightRPC
Reporter: Wes McKinney


The ODBC analogue to ARROW-7744



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-14303) [C++][Parquet] Do not duplicate Schema metadata in Parquet schema metadata and serialized ARROW:schema value

2021-10-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-14303:


 Summary: [C++][Parquet] Do not duplicate Schema metadata in 
Parquet schema metadata and serialized ARROW:schema value
 Key: ARROW-14303
 URL: https://issues.apache.org/jira/browse/ARROW-14303
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 6.0.0


Metadata values are being duplicated in the Parquet file footer — we should 
either only store them in ARROW:schema or the Parquet schema metadata. Removing 
them from the Parquet schema metadata may break applications that are expecting 
that metadata to be there when serialized from Arrow, so dropping the keys from 
ARROW:schema is probably a safer choice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13469) [C++] Suppress -Wmissing-field-initializers in DayMilliseconds arrow/type.h

2021-07-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-13469:


 Summary: [C++] Suppress -Wmissing-field-initializers in 
DayMilliseconds arrow/type.h
 Key: ARROW-13469
 URL: https://issues.apache.org/jira/browse/ARROW-13469
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 6.0.0


The absence of default values for {{days}} and {{milliseconds}} triggers a 
compiler warning in some compilers. This could be resolved by setting the 
struct member default values to 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13023) [Go] Upgrade "text" dependency to mitigate CVE

2021-06-09 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-13023:


 Summary: [Go] Upgrade "text" dependency to mitigate CVE
 Key: ARROW-13023
 URL: https://issues.apache.org/jira/browse/ARROW-13023
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Wes McKinney


See automated report https://github.com/apache/arrow/issues/10392



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-13021) [C++] Add/improve documentation about employing Arrow in downstream CMake projects

2021-06-09 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-13021:


 Summary: [C++] Add/improve documentation about employing Arrow in 
downstream CMake projects
 Key: ARROW-13021
 URL: https://issues.apache.org/jira/browse/ARROW-13021
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


In our C++ documentation, it may be useful to create a section about how we 
recommend introducing Arrow as a build / runtime dependency of downstream 
projects, particularly other CMake-based build systems. This would be at this 
level:

https://arrow.apache.org/docs/cpp/index.html

We have the "Minimal Build" example in the codebase which helps, but it may not 
cover all the various ways that people need to be able to depend on the 
project. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12884) [Flight] Data checksumming support

2021-05-26 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-12884:


 Summary: [Flight] Data checksumming support
 Key: ARROW-12884
 URL: https://issues.apache.org/jira/browse/ARROW-12884
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


Currently, there is not a built-in mechanism to allow for data integrity checks 
for FlightData messages. This issue is to discuss and see if there may be a way 
to add this to Flight without making things more complicated for the 
non-checksummed use case



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12849) [C++] Implement scalar kernel function that computes "isin" for each element in a List array

2021-05-21 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-12849:


 Summary: [C++] Implement scalar kernel function that computes 
"isin" for each element in a List array
 Key: ARROW-12849
 URL: https://issues.apache.org/jira/browse/ARROW-12849
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


The type signature would look like this:

{code}
(Array>, Scalar) -> Array
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12530) [C++] Remove Buffer::mutable_data_ member and use const_cast on data_ only if is_mutable_ is true

2021-04-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-12530:


 Summary: [C++] Remove Buffer::mutable_data_ member and use 
const_cast on data_ only if is_mutable_ is true
 Key: ARROW-12530
 URL: https://issues.apache.org/jira/browse/ARROW-12530
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 5.0.0


Proposed new implementation of mutable_data() by [~apitrou]

{code}
  uint8_t* mutable_data() {
 return is_mutable() ? const_cast(data()) : nullptr;
   }
{code}

This will help avoid various classes of bugs (initializing Buffer subclasses 
incorrectly) and make the object smaller on the heap



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12495) [C++][Python] NumPy buffer sets is_mutable_ to true but does not set mutable_data_ when the NumPy array is writable

2021-04-21 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-12495:


 Summary: [C++][Python] NumPy buffer sets is_mutable_ to true but 
does not set mutable_data_ when the NumPy array is writable
 Key: ARROW-12495
 URL: https://issues.apache.org/jira/browse/ARROW-12495
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Wes McKinney
 Fix For: 4.0.0


Bug is evident

{code}
NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) {
  PyAcquireGIL lock;
  arr_ = ao;
  Py_INCREF(ao);

  if (PyArray_Check(ao)) {
PyArrayObject* ndarray = reinterpret_cast(ao);
data_ = reinterpret_cast(PyArray_DATA(ndarray));
size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize;
capacity_ = size_;

if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) {
  is_mutable_ = true;
}
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12404) [C++] Implement "random" nullary function that generates uniform random between 0 and 1

2021-04-15 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-12404:


 Summary: [C++] Implement "random" nullary function that generates 
uniform random between 0 and 1
 Key: ARROW-12404
 URL: https://issues.apache.org/jira/browse/ARROW-12404
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


This is similar to PostgreSQL's random() 

https://www.postgresql.org/docs/8.2/functions-math.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12280) [Developer] Remove @-mentions from commit messages in merge tool

2021-04-07 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-12280:


 Summary: [Developer] Remove @-mentions from commit messages in 
merge tool
 Key: ARROW-12280
 URL: https://issues.apache.org/jira/browse/ARROW-12280
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 4.0.0


When someone @-mentions someone in their PR description, it triggers spam 
e-mails in GitHub's system to all the mentioned people each time someone 
synchronizes their fork. For example, this commit triggered an e-mail to me

https://github.com/bkietz/arrow/commit/b2fa55db273d44b14814d45dae8525b065e01a91

It would be fairly each to sanitize @-mentions to simply strip the @-symbol, 
with the right regular expression of course (since the characters after the @ 
symbol can include hyphens or underscores, but otherwise any ASCII alphanumeric 
character)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11661) [C++] Compilation failure in arrow/scalar.cc on Xcode 8.3.3

2021-02-16 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-11661:


 Summary: [C++] Compilation failure in arrow/scalar.cc on Xcode 
8.3.3 
 Key: ARROW-11661
 URL: https://issues.apache.org/jira/browse/ARROW-11661
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


See https://gist.github.com/wesm/e3b52381de1556f2af669c7e2458afd0

It seems that this template construct is not supported so robustly across older 
compilers:

{code}
// timestamp to string
Status CastImpl(const TimestampScalar& from, StringScalar* to) {
  to->value = FormatToBuffer(internal::StringFormatter{}, from);
  return Status::OK();
}

// date to string
template 
Status CastImpl(const DateScalar& from, StringScalar* to) {
  TimestampScalar ts({}, timestamp(TimeUnit::MILLI));
  RETURN_NOT_OK(CastImpl(from, ));
  return CastImpl(ts, to);
}

// string to any
template 
Status CastImpl(const StringScalar& from, ScalarType* to) {
  ARROW_ASSIGN_OR_RAISE(auto out,
Scalar::Parse(to->type, 
util::string_view(*from.value)));
  to->value = std::move(checked_cast(*out).value);
  return Status::OK();
}

// binary to string
Status CastImpl(const BinaryScalar& from, StringScalar* to) {
  to->value = from.value;
  return Status::OK();
}

// formattable to string
template ,
  // note: Value unused but necessary to trigger SFINAE if Formatter is
  // undefined
  typename Value = typename Formatter::value_type>
Status CastImpl(const ScalarType& from, StringScalar* to) {
  to->value = FormatToBuffer(Formatter{from.type}, from);
  return Status::OK();
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11643) [C++] protobuf_ep failure on Xcode 8.3.3 / Apple LLVM 8.1

2021-02-16 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-11643:


 Summary: [C++] protobuf_ep failure on Xcode 8.3.3 / Apple LLVM 8.1
 Key: ARROW-11643
 URL: https://issues.apache.org/jira/browse/ARROW-11643
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


I randomly decided to see if we can still build and run on a pre-SSE4.2 machine 
(2009-era MacBook), but protobuf_ep fails with

{code}
FAILED: 
CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o
 
/Applications/Xcode8.3.3.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
  -DGOOGLE_PROTOBUF_CMAKE_BUILD -DHAVE_PTHREAD -DHAVE_ZLIB -I. 
-I/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src 
-Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG -O3 -DNDEBUG -fPIC  
-Qunused-arguments -fcolor-diagnostics -O3 -DNDEBUG -O3 -DNDEBUG -fPIC   
-std=c++11 -MD -MT 
CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o
 -MF 
CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o.d
 -o 
CMakeFiles/libprotobuf.dir/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc.o
 -c 
/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc
In file included from 
/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/dynamic_message.cc:80:
/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/map_field.h:332:37:
 error: constexpr constructor never produces a constant expression 
[-Winvalid-constexpr]
  explicit PROTOBUF_MAYBE_CONSTEXPR MapFieldBase(ConstantInitialized)
^
/Users/wesm/code/arrow/cpp/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/map_field.h:335:9:
 note: non-literal type 'internal::WrappedMutex' cannot be used in a constant 
expression
mutex_(GOOGLE_PROTOBUF_LINKER_INITIALIZED),
^
1 error generated.
{code}

Since this appears to be a warning, perhaps it can be suppressed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11391) [C++] HdfsOutputStream::Write unsafely truncates integers exceeding INT32_MAX

2021-01-26 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-11391:


 Summary: [C++] HdfsOutputStream::Write unsafely truncates integers 
exceeding INT32_MAX
 Key: ARROW-11391
 URL: https://issues.apache.org/jira/browse/ARROW-11391
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 4.0.0


Originally reported on user@, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.cc#L277

{{tSize} is a 32-bit integer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11120) [Python][R] Prove out plumbing to pass data between Python and R using rpy2

2021-01-03 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-11120:


 Summary: [Python][R] Prove out plumbing to pass data between 
Python and R using rpy2
 Key: ARROW-11120
 URL: https://issues.apache.org/jira/browse/ARROW-11120
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python, R
Reporter: Wes McKinney


Per discussion on the mailing list, we should see what is required (if 
anything) to be able to pass data structures using the C interface between 
Python and R from the perspective of the Python user using rpy2. rpy2 is sort 
of the Python version of reticulate. Unit tests will then validate that it's 
working



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11009) [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc

2020-12-22 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-11009:


 Summary: [Python] Add environment variable to elect default usage 
of system memory allocator instead of jemalloc/mimalloc
 Key: ARROW-11009
 URL: https://issues.apache.org/jira/browse/ARROW-11009
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 3.0.0


We routinely get reports like ARROW-11007 where there is suspicion of a memory 
leak (which may or may not be valid) — having an easy way (requiring no changes 
to application code) to toggle usage of the non-system memory allocator would 
help with determining whether the memory usage patterns are the result of the 
allocator being used. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10658) [Python] Wheel builds for Apple Silicon

2020-11-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10658:


 Summary: [Python] Wheel builds for Apple Silicon
 Key: ARROW-10658
 URL: https://issues.apache.org/jira/browse/ARROW-10658
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Wes McKinney


We are only able to create Intel builds at the moment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10657) [CI] Continuous integration on Apple M1 architecture

2020-11-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10657:


 Summary: [CI] Continuous integration on Apple M1 architecture
 Key: ARROW-10657
 URL: https://issues.apache.org/jira/browse/ARROW-10657
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Developer Tools
Reporter: Wes McKinney
 Fix For: 3.0.0


It would be nice if we had some confidence that our next major release runs on 
Apple Silicon. I am looking at hooking up an M1 Mac Mini to Buildkite so that 
we are able to run CI jobs on one. If anyone else would like to contribute a 
machine to the build cluster, please be our guest



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10648) [Java] Prepare Java codebase for source release without requiring any git tags to be created or pushed

2020-11-18 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10648:


 Summary: [Java] Prepare Java codebase for source release without 
requiring any git tags to be created or pushed
 Key: ARROW-10648
 URL: https://issues.apache.org/jira/browse/ARROW-10648
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools, Java
Reporter: Wes McKinney
 Fix For: 3.0.0


This makes the release process a lot more complex and makes it hard for us to 
create nightly source RCs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10598) [C++] Improve performance of GenerateBitsUnrolled

2020-11-15 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10598:


 Summary: [C++] Improve performance of GenerateBitsUnrolled 
 Key: ARROW-10598
 URL: https://issues.apache.org/jira/browse/ARROW-10598
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 3.0.0


internal::GenerateBitsUnrolled doesn't vectorize too well, there are some 
improvements we can make to get better code generation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10569) [C++][Python] Poor Table filtering performance

2020-11-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10569:


 Summary: [C++][Python] Poor Table filtering performance
 Key: ARROW-10569
 URL: https://issues.apache.org/jira/browse/ARROW-10569
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Wes McKinney
 Fix For: 3.0.0


>From the mailing list

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

num_rows = 10_000_000
data = np.random.randn(num_rows)

df = pd.DataFrame({'data{}'.format(i): data
                   for i in range(100)})

df['key'] = np.random.randint(0, 100, size=num_rows)

rb = pa.record_batch(df)
t = pa.table(df)

I found that the performance of filtering a record batch is very similar:

In [22]: timeit df[df.key == 5]
71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Whereas the performance of filtering a table is absolutely abysmal (no
idea what's going on here)

In [23]: %timeit t.filter(pc.equal(t[-1], 5))
961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 {code}
 

[https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10567) [C++] Add options to help increase precision of arrow-flight-benchmark

2020-11-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10567:


 Summary: [C++] Add options to help increase precision of 
arrow-flight-benchmark
 Key: ARROW-10567
 URL: https://issues.apache.org/jira/browse/ARROW-10567
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10474) [C++] Implement vector kernel that transforms a boolean mask into selection indices

2020-11-02 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10474:


 Summary: [C++] Implement vector kernel that transforms a boolean 
mask into selection indices
 Key: ARROW-10474
 URL: https://issues.apache.org/jira/browse/ARROW-10474
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


See discussion in ARROW-10423



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10465) [C++] Faster PEXT for AMD CPUs

2020-11-02 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10465:


 Summary: [C++] Faster PEXT for AMD CPUs
 Key: ARROW-10465
 URL: https://issues.apache.org/jira/browse/ARROW-10465
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 3.0.0


See [https://twitter.com/InstLatX64/status/1322503571288559617]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10380) [C++] Running tests with ASAN, UBSAN using conda-forge compiler toolchain on macOS

2020-10-23 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10380:


 Summary: [C++] Running tests with ASAN, UBSAN using conda-forge 
compiler toolchain on macOS 
 Key: ARROW-10380
 URL: https://issues.apache.org/jira/browse/ARROW-10380
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


I tried running the test suite with ASAN/UBSAN enabled using the conda-forge 
toolchain (following the instructions in the Python documentation) and found 
that it's horribly broken, at least with the way that I'm running it. I would 
guess there is some additional configuration necessary or perhaps the compiler 
flags are wrong.

see for example

https://gist.github.com/wesm/88aa66f90a642fd0a051c4a7960de350

here are what the compiler flags look like from the CMake output

{code}
-- CMAKE_C_FLAGS: -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC 
-fPIE -fstack-protector-strong -O2 -pipe -isystem 
/Users/wesm/miniconda/envs/pyarrow-dev/include -Qunused-arguments -ggdb -O0  
-Wall -Wextra -Wdocumentation -Wno-missing-braces -Wno-unused-parameter 
-Wno-unknown-warning-option -Wno-constant-logical-operand -Werror 
-Wno-unknown-warning-option -Wno-pass-failed -march=haswell -mavx2 
-- CMAKE_CXX_FLAGS:  -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC 
-fPIE -fstack-protector-strong -O2 -pipe -stdlib=libc++ 
-fvisibility-inlines-hidden -std=c++14 -fmessage-length=0 -isystem 
/Users/wesm/miniconda/envs/pyarrow-dev/include -Qunused-arguments 
-fcolor-diagnostics -ggdb -O0  -Wall -Wextra -Wdocumentation 
-Wno-missing-braces -Wno-unused-parameter -Wno-unknown-warning-option 
-Wno-constant-logical-operand -Werror -Wno-unknown-warning-option 
-Wno-pass-failed -march=haswell -mavx2  -fsanitize=address -DADDRESS_SANITIZER 
-fsanitize=undefined -fno-sanitize=alignment,vptr,function,float-divide-by-zero 
-fno-sanitize-recover=all 
-fsanitize-blacklist=/Users/wesm/code/arrow/cpp/build-support/sanitizer-disallowed-entries.txt
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10351) [C++][Flight] See if reading/writing to gRPC get/put streams asynchronously helps performance

2020-10-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10351:


 Summary: [C++][Flight] See if reading/writing to gRPC get/put 
streams asynchronously helps performance
 Key: ARROW-10351
 URL: https://issues.apache.org/jira/browse/ARROW-10351
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


We don't use any asynchronous concepts in the way that Flight is implemented 
now, i.e. IPC deconstruction/reconstruction (which may include compression!) is 
not performed concurrent with moving FlightData objects through the gRPC 
machinery, which may yield suboptimal performance. 

It might be better to apply an actor-type approach where a dedicated thread 
retrieves and prepares the next raw IPC message (within a Future) while the 
current IPC message is being processed -- that way reading/writing to/from the 
gRPC stream is not blocked on the IPC code doing its thing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10147) [Python] Constructing pandas metadata fails if an Index name is not JSON-serializable by default

2020-09-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10147:


 Summary: [Python] Constructing pandas metadata fails if an Index 
name is not JSON-serializable by default
 Key: ARROW-10147
 URL: https://issues.apache.org/jira/browse/ARROW-10147
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 2.0.0


originally reported in https://github.com/apache/arrow/issues/8270

here's a minimal reproduction:

{code}
In [24]: idx = pd.RangeIndex(0, 4, name=np.int64(6))
   

In [25]: df = pd.DataFrame(index=idx)   
   

In [26]: pa.table(df)   
   
---
TypeError Traceback (most recent call last)
 in 
> 1 pa.table(df)

~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.table()

~/code/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/code/arrow/python/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, 
preserve_index, nthreads, columns, safe)
604 pandas_metadata = construct_metadata(df, column_names, 
index_columns,
605  index_descriptors, 
preserve_index,
--> 606  types)
607 metadata = deepcopy(schema.metadata) if schema.metadata else dict()
608 metadata.update(pandas_metadata)

~/code/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
column_names, index_levels, index_descriptors, preserve_index, types)
243 'version': pa.__version__
244 },
--> 245 'pandas_version': _pandas_api.version
246 }).encode('utf8')
247 }

~/miniconda/envs/arrow-3.7/lib/python3.7/json/__init__.py in dumps(obj, 
skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, 
default, sort_keys, **kw)
229 cls is None and indent is None and separators is None and
230 default is None and not sort_keys and not kw):
--> 231 return _default_encoder.encode(obj)
232 if cls is None:
233 cls = JSONEncoder

~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in encode(self, o)
197 # exceptions aren't as detailed.  The list call should be 
roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)

~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in iterencode(self, o, 
_one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258 
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

~/miniconda/envs/arrow-3.7/lib/python3.7/json/encoder.py in default(self, o)
177 
178 """
--> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
180 f'is not JSON serializable')
181 

TypeError: Object of type int64 is not JSON serializable
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10121) [C++][Python] Variable dictionaries do not survive roundtrip to IPC stream

2020-09-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10121:


 Summary: [C++][Python] Variable dictionaries do not survive 
roundtrip to IPC stream
 Key: ARROW-10121
 URL: https://issues.apache.org/jira/browse/ARROW-10121
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Wes McKinney
 Fix For: 2.0.0


Failing test case (from dev@ 
https://lists.apache.org/thread.html/r338942b4e9f9316b48e87aab41ac49c7ffedd45733d4a6349523b7eb%40%3Cdev.arrow.apache.org%3E)

{code}
import pyarrow as pa
from io import BytesIO

pa.__version__

schema = pa.schema([pa.field('foo', pa.int32()), pa.field('bar', 
pa.dictionary(pa.int32(), pa.string()))] )
r1 = pa.record_batch(
[
[1, 2, 3, 4, 5],
pa.array(["a", "b", "c", "d", "e"]).dictionary_encode()
],
schema
)

r1.validate()
r2 = pa.record_batch(
[
[1, 2, 3, 4, 5],
pa.array(["c", "c", "e", "f", "g"]).dictionary_encode()
],
schema
)

r2.validate()

assert r1.column(1).dictionary != r2.column(1).dictionary


sink =  pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, schema)

writer.write(r1)
writer.write(r2)

serialized = BytesIO(sink.getvalue().to_pybytes())
stream = pa.ipc.open_stream(serialized)

deserialized = []

while True:
try:
deserialized.append(stream.read_next_batch())
except StopIteration:
break

assert deserialized[1][1].to_pylist() == r2[1].to_pylist()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10117) [C++] Implement work-stealing scheduler / multiple queues in ThreadPool

2020-09-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10117:


 Summary: [C++] Implement work-stealing scheduler / multiple queues 
in ThreadPool
 Key: ARROW-10117
 URL: https://issues.apache.org/jira/browse/ARROW-10117
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This involves a change from a single task queue shared amongst all threads to a 
per-thread task queue and the ability for idle threads to take tasks from other 
threads' queues (work stealing). 

As part of this, the task submission API would need to be evolved in some 
fashion to allow for tasks related to a particular workload to end up in the 
same task queue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10097) [C++] Persist SetLookupState in between usages of IsIn when filtering dataset batches

2020-09-25 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-10097:


 Summary: [C++] Persist SetLookupState in between usages of IsIn 
when filtering dataset batches
 Key: ARROW-10097
 URL: https://issues.apache.org/jira/browse/ARROW-10097
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 1.0.1
Reporter: Wes McKinney
 Fix For: 2.0.0


Building a large hash table has a non-trivial cost. 

See mailing list discussion

https://lists.apache.org/thread.html/rb85519cc21ffb09a836a9107919e07b076165ff81c22fb88b59a8296%40%3Cuser.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9983) [C++][Dataset] Use larger default batch size than 32K for Datasets API

2020-09-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9983:
---

 Summary: [C++][Dataset] Use larger default batch size than 32K for 
Datasets API
 Key: ARROW-9983
 URL: https://issues.apache.org/jira/browse/ARROW-9983
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0


Dremio uses 64K batch sizes. We could probably get away with even larger batch 
sizes (e.g. 256K or 1M) and allow memory-constrained users to elect a smaller 
batch size. 

See example of some performance issues related to this in ARROW-9924



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9924) [Python] Performance regression reading individual Parquet files using Dataset interface

2020-09-06 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9924:
---

 Summary: [Python] Performance regression reading individual 
Parquet files using Dataset interface
 Key: ARROW-9924
 URL: https://issues.apache.org/jira/browse/ARROW-9924
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 2.0.0


I haven't investigated very deeply but this seems symptomatic of a problem:

{code}
In [27]: df = pd.DataFrame({'A': np.random.randn(1000)})

  

In [28]: pq.write_table(pa.table(df), 'test.parquet')   

  

In [29]: timeit pq.read_table('test.parquet')   

  
79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True)  

  
66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9843) [C++] Implement Between trinary kernel

2020-08-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9843:
---

 Summary: [C++] Implement Between trinary kernel
 Key: ARROW-9843
 URL: https://issues.apache.org/jira/browse/ARROW-9843
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


A specialized {{between(arr, left_bound, right_bound)}} kernel would multiple 
scans and AND operation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9842) [C++] Explore alternative strategy for Compare kernel implementation for better performance

2020-08-24 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9842:
---

 Summary: [C++] Explore alternative strategy for Compare kernel 
implementation for better performance
 Key: ARROW-9842
 URL: https://issues.apache.org/jira/browse/ARROW-9842
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0


The compiler may be able to vectorize comparison options if the bitpacking of 
results is deferred until the end (or in chunks). Instead, a temporary bytemap 
can be populated on a chunk-by-chunk basis and then the bytemaps can be 
bitpacked into the output buffer. This may also reduce the code size of the 
compare kernels (which are actually quite large at the moment)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9761) [C++] Add experimental pull-based iterator structures to C interface implementation

2020-08-16 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9761:
---

 Summary: [C++] Add experimental pull-based iterator structures to 
C interface implementation
 Key: ARROW-9761
 URL: https://issues.apache.org/jira/browse/ARROW-9761
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0


This purpose of this would be to validate some initial use cases / workflows 
prior to potentially formalizing the interface in the C ABI



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9740) [C++] Add minimal build option to use ExternalProject to build Arrow from CMake

2020-08-14 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9740:
---

 Summary: [C++] Add minimal build option to use ExternalProject to 
build Arrow from CMake
 Key: ARROW-9740
 URL: https://issues.apache.org/jira/browse/ARROW-9740
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9705) [C++] Validate that intraday time is zeroed out in Date64 data

2020-08-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9705:
---

 Summary: [C++] Validate that intraday time is zeroed out in Date64 
data
 Key: ARROW-9705
 URL: https://issues.apache.org/jira/browse/ARROW-9705
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9634) [C++][Python] Restore non-UTC time zones when reading Parquet file that was previously Arrow

2020-08-03 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9634:
---

 Summary: [C++][Python] Restore non-UTC time zones when reading 
Parquet file that was previously Arrow
 Key: ARROW-9634
 URL: https://issues.apache.org/jira/browse/ARROW-9634
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Wes McKinney
 Fix For: 2.0.0


This was reported on the mailing list

{code}
In [20]: df = pd.DataFrame({'a': pd.Series(np.arange(0, 1, 
1000)).astype(pd.DatetimeTZDtype('ns', 'America/Los_Angeles'
...: ))})   
   

In [21]: t = pa.table(df)   
   

In [22]: t  
   
Out[22]: 
pyarrow.Table
a: timestamp[ns, tz=America/Los_Angeles]

In [23]: pq.write_table(t, 'test.parquet')  
   

In [24]: pq.read_table('test.parquet')  
   
Out[24]: 
pyarrow.Table
a: timestamp[us, tz=UTC]

In [25]: pq.read_table('test.parquet')[0]   
   
Out[25]: 

[
  [
1970-01-01 00:00:00.00,
1970-01-01 00:00:00.01,
1970-01-01 00:00:00.02,
1970-01-01 00:00:00.03,
1970-01-01 00:00:00.04,
1970-01-01 00:00:00.05,
1970-01-01 00:00:00.06,
1970-01-01 00:00:00.07,
1970-01-01 00:00:00.08,
1970-01-01 00:00:00.09
  ]
]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9633) [C++] Do not toggle memory mapping globally in LocalFileSystem

2020-08-03 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9633:
---

 Summary: [C++] Do not toggle memory mapping globally in 
LocalFileSystem
 Key: ARROW-9633
 URL: https://issues.apache.org/jira/browse/ARROW-9633
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 2.0.0


In the context of the Datasets API, some file formats benefit greatly from 
memory mapping (like Arrow IPC files) while other less so. Additionally, in 
some scenarios, memory mapping could fail when used on network-attached storage 
devices. Since a filesystem may be used to read different kinds of files and 
use both memory mapping and non-memory mapping, and additionally the Datasets 
API should be able to fall back on non-memory mapping if the attempt to memory 
map fails, it would make sense to have a non-global option for this:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h

I would suggest adding a new filesystem API with something like 
{{OpenMappedInputFile}} with some options to control the behavior when memory 
mapping is not possible. These options may be among:

* Falling back on a normal RandomAccessFile
* Reading the entire file into memory (or even tmpfs?) and then wrapping it in 
a BufferReader
* Failing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9612) [Python] Automatically back on larger IO block size when JSON parsing fails

2020-07-31 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9612:
---

 Summary: [Python] Automatically back on larger IO block size when 
JSON parsing fails
 Key: ARROW-9612
 URL: https://issues.apache.org/jira/browse/ARROW-9612
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 2.0.0


>From GitHub issue

https://github.com/apache/arrow/issues/7835

This seems like a less than ideal failure mode, perhaps when this occurs it 
could automatically change to processing the file as a single block?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9610) [Python] Struct types are unhandled in AppendNdarrayItem when converting from a Python sequence to Arrow array in python_to_arrow.cc

2020-07-31 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9610:
---

 Summary: [Python] Struct types are unhandled in AppendNdarrayItem 
when converting from a Python sequence to Arrow array in python_to_arrow.cc
 Key: ARROW-9610
 URL: https://issues.apache.org/jira/browse/ARROW-9610
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 2.0.0


See 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/python_to_arrow.cc#L759

and mailing list discussion

https://lists.apache.org/thread.html/r0a42d315df94997b7a01488d8309a0ad8f3b63997b8b29fdfb23932e%40%3Cuser.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9601) [C++][Flight] IpcWriteOptions do not appear to be propagated in DoGet requests

2020-07-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9601:
---

 Summary: [C++][Flight] IpcWriteOptions do not appear to be 
propagated in DoGet requests
 Key: ARROW-9601
 URL: https://issues.apache.org/jira/browse/ARROW-9601
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, FlightRPC
Reporter: Wes McKinney
 Fix For: 2.0.0


I haven't fully investigated this yet, but I have found that while compression 
(e.g. ZSTD) is respected in DoPut requests on the client side, it does not 
appear to propagate through DoGet requests. This may be a bug or by design, but 
I think it should be possible for the client to request that compression be 
employed when serving a DoGet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9525) [Python] Array.str shows misleading output for timestamp types with time zone set

2020-07-19 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9525:
---

 Summary: [Python] Array.__str__ shows misleading output for 
timestamp types with time zone set
 Key: ARROW-9525
 URL: https://issues.apache.org/jira/browse/ARROW-9525
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Wes McKinney


The output is being shown with UTC interpretation

{code}
In [13]: arr = pa.array([0, 1, 2], type=pa.timestamp('ns', 
'America/Los_Angeles')) 

In [14]: arr.view('int64')  
   
Out[14]: 

[
  0,
  1,
  2
]

In [15]: arr.type   
   
Out[15]: TimestampType(timestamp[ns, tz=America/Los_Angeles])

In [16]: arr
   
Out[16]: 

[
  1970-01-01 00:00:00.0,
  1970-01-01 00:00:00.1,
  1970-01-01 00:00:00.2
]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9518) [Python] Deprecate Union-based serialization implemented by pyarrow.serialization

2020-07-17 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9518:
---

 Summary: [Python] Deprecate Union-based serialization implemented 
by pyarrow.serialization
 Key: ARROW-9518
 URL: https://issues.apache.org/jira/browse/ARROW-9518
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 2.0.0


Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9500) [C++] Fix segfault with std::to_string in -O3 builds on gcc 7.5.0

2020-07-15 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9500:
---

 Summary: [C++] Fix segfault with std::to_string in -O3 builds on 
gcc 7.5.0
 Key: ARROW-9500
 URL: https://issues.apache.org/jira/browse/ARROW-9500
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


There seems to be a gcc bug related to {{std::to_string}} that only appears in 
{{-O3}} builds. It can be seen in something innocuous like

{code}
return Status::Invalid("Float value ", std::to_string(val), " was truncated 
converting to",
   *output.type());
{code}

where val is NaN. I haven't found a canonical reference but using something 
other than to_string for the formatting (here just letting 
{{std::ostringstream}} take care of it) makes the problem go away

I wasn't able to reproduce the issue with gcc-8




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9498) [C++][Parquet] Consider revamping RleDecoder based on "upstream" changes in Apache Impala

2020-07-15 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9498:
---

 Summary: [C++][Parquet] Consider revamping RleDecoder based on 
"upstream" changes in Apache Impala
 Key: ARROW-9498
 URL: https://issues.apache.org/jira/browse/ARROW-9498
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Since the initial code import in 2016, Impala made some improvements to 
RleDecoder that we might examine to see if they are beneficial for us

See https://github.com/apache/impala/blob/master/be/src/util/rle-encoding.h and 
history thereof



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9497) [C++][Parquet] Fix failure caused by malformed repetition/definition levels

2020-07-15 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9497:
---

 Summary: [C++][Parquet] Fix failure caused by malformed 
repetition/definition levels
 Key: ARROW-9497
 URL: https://issues.apache.org/jira/browse/ARROW-9497
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney


Fix a case discovered by OSS-Fuzz



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9451) [Python] Unsigned integer types will accept string values in pyarrow.array

2020-07-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9451:
---

 Summary: [Python] Unsigned integer types will accept string values 
in pyarrow.array
 Key: ARROW-9451
 URL: https://issues.apache.org/jira/browse/ARROW-9451
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


See

{code}
In [12]: pa.array(['5'], type='uint32') 

  
Out[12]: 

[
  5
]
{code}

Also:

{code}
In [9]: pa.scalar('5', type='uint8')

  
Out[9]: 

In [10]: pa.scalar('5', type='uint16')  

  
Out[10]: 

In [11]: pa.scalar('5', type='uint32')  

  
Out[11]: 
{code}

But:

{code}
In [13]: pa.array(['5'], type='int32')  

  
---
TypeError Traceback (most recent call last)
 in 
> 1 pa.array(['5'], type='int32')

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
267 else:
268 # ConvertPySequence does strict conversion if type is 
explicitly passed
--> 269 return _sequence_to_array(obj, mask, size, type, pool, 
c_from_pandas)
270 
271 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
 36 
 37 with nogil:
---> 38 check_status(ConvertPySequence(sequence, mask, options, ))
 39 
 40 if out.get().num_chunks() == 1:

TypeError: an integer is required (got type str)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9450) [Python] "pytest pyarrow" takes over 10 seconds to collect tests and start executing

2020-07-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9450:
---

 Summary: [Python] "pytest pyarrow" takes over 10 seconds to 
collect tests and start executing
 Key: ARROW-9450
 URL: https://issues.apache.org/jira/browse/ARROW-9450
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This seems to be a new development



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9446) [C++] Export compiler information in BuildInfo

2020-07-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9446:
---

 Summary: [C++] Export compiler information in BuildInfo
 Key: ARROW-9446
 URL: https://issues.apache.org/jira/browse/ARROW-9446
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


This may help improve debugging and reporting



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9442) [Python] Add pyarrow_wrap_table_no_validate API

2020-07-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9442:
---

 Summary: [Python] Add pyarrow_wrap_table_no_validate API
 Key: ARROW-9442
 URL: https://issues.apache.org/jira/browse/ARROW-9442
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I have discovered that the forced validation check in pyarrow_wrap_table can 
add 20-30% time to a call to {{RecordBatchStreamReader.read_all}}, which should 
be expected to be already valid. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9441) [C++] Optimize RecordBatchReader::ReadAll

2020-07-13 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9441:
---

 Summary: [C++] Optimize RecordBatchReader::ReadAll
 Key: ARROW-9441
 URL: https://issues.apache.org/jira/browse/ARROW-9441
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Based on perf reports, more time is spent manipulating C++ data structures than 
reconstructing record batches from IPC messages, which strikes me as not what 
we want

here is from a perf report based on the Python code

{code}
for i in range(100):
pa.ipc.open_stream('nyctaxi.arrow').read_all()
{code}

{code}
-   50.40% 0.06%  python   libarrow.so.100.0.0  [.] 
arrow::RecordBatchReader::ReadAll
   - 50.34% arrow::RecordBatchReader::ReadAll 
  - 25.86% arrow::Table::FromRecordBatches
 - 18.41% arrow::SimpleRecordBatch::column
- 16.00% arrow::MakeArray
   - 10.49% 
arrow::VisitTypeInline  
7.71% arrow::PrimitiveArray::SetData   
1.87% arrow::StringArray::StringArray  
   1.54% __pthread_mutex_lock  
   0.88% __pthread_mutex_unlock
   0.67% std::_Hash_bytes  
   0.60% arrow::ChunkedArray::ChunkedArray 
  - 22.30% arrow::RecordBatchReader::ReadAll   
 - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext
- 15.91% arrow::ipc::ReadRecordBatchInternal
   - 15.15% arrow::ipc::LoadRecordBatch
  - 14.45% arrow::ipc::ArrayLoader::Load
 + 13.15% arrow::VisitTypeInline
+ 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 
1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch
{code}

Perhaps {{ChunkedArray}} internally should be changed to contain a vector of 
{{ArrayData}} instead of boxed Arrays. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9424) [C++][Parquet] Disable writing files with LZ4 codec

2020-07-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9424:
---

 Summary: [C++][Parquet] Disable writing files with LZ4 codec
 Key: ARROW-9424
 URL: https://issues.apache.org/jira/browse/ARROW-9424
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9419) [C++] Test that "fill_null" function works with sliced inputs, expand tests

2020-07-12 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9419:
---

 Summary: [C++] Test that "fill_null" function works with sliced 
inputs, expand tests
 Key: ARROW-9419
 URL: https://issues.apache.org/jira/browse/ARROW-9419
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


I observed some bugs in the implementation that I did yesterday so adding tests 
to cover them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9412) [C++] Add non-BUNDLED dependencies to exported INSTALL_INTERFACE_LIBS of arrow_static and test that it works

2020-07-10 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9412:
---

 Summary: [C++] Add non-BUNDLED dependencies to exported 
INSTALL_INTERFACE_LIBS of arrow_static and test that it works
 Key: ARROW-9412
 URL: https://issues.apache.org/jira/browse/ARROW-9412
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


As a companion project to ARROW-7605, we must document and test a workflow for 
statically linking with external static dependencies.

When a dependency is not built as BUNDLED, it can be added to 
"ARROW_STATIC_INSTALL_INTERFACE_LIBS" so that it's included in 
ArrowTargets-*.cmake. The third party project of course must configure the 
dependent CMake targets

Prior to the patch for ARROW-7605, toolchain libraries were added 
unconditionally to ARROW_STATIC_INSTALL_INTERFACE_LIBS whether BUNDLED or not 
(including our private jemalloc), creating a broken CMake "arrow_static" 
target. So this patch is to partially revert these changes to enable static 
linking with external toolchain libraries without breaking the BUNDLED static 
builds. Finally, this must be tested similar to 
cpp/examples/minimal_build/run_static.sh so that we can verify that each of the 
build/link scenarios are working correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9400) [Python] Do not depend on conda-forge static libraries in Windows wheel builds

2020-07-09 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9400:
---

 Summary: [Python] Do not depend on conda-forge static libraries in 
Windows wheel builds
 Key: ARROW-9400
 URL: https://issues.apache.org/jira/browse/ARROW-9400
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Based on 
https://github.com/conda-forge/cfep/blob/e9bb3f58eca79107baede71cb9b05311705a10f2/cfep-18.md
 it appears that static libraries may not be included in the future in many 
packages that we use for building the Windows Python wheels. We should change 
the build to use BUNDLED builds so we don't have this issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9399) [C++] Add forward compatibility checks for unrecognized future MetadataVersion

2020-07-09 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9399:
---

 Summary: [C++] Add forward compatibility checks for unrecognized 
future MetadataVersion
 Key: ARROW-9399
 URL: https://issues.apache.org/jira/browse/ARROW-9399
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


We should have no need of these checks in theory, but they present a safeguard 
should some years in the future it became necessary to increment the 
MetadataVersion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9396) [Python] Expose CpuInfo for informational / debugging purposes

2020-07-09 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9396:
---

 Summary: [Python] Expose CpuInfo for informational / debugging 
purposes
 Key: ARROW-9396
 URL: https://issues.apache.org/jira/browse/ARROW-9396
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


This would help to see what CpuInfo says about the current processor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9395) [Python] Provide configurable MetadataVersion in IPC API and environment variable to set default to V4 when needed

2020-07-09 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9395:
---

 Summary: [Python] Provide configurable MetadataVersion in IPC API 
and environment variable to set default to V4 when needed
 Key: ARROW-9395
 URL: https://issues.apache.org/jira/browse/ARROW-9395
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


This is a follow up to ARROW-9265 and must be implemented in order to release 
1.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9379) [Rust] Support unsigned dictionary indices

2020-07-08 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9379:
---

 Summary: [Rust] Support unsigned dictionary indices
 Key: ARROW-9379
 URL: https://issues.apache.org/jira/browse/ARROW-9379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9378) [Go] Support unsigned dictionary indices

2020-07-08 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9378:
---

 Summary: [Go] Support unsigned dictionary indices
 Key: ARROW-9378
 URL: https://issues.apache.org/jira/browse/ARROW-9378
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9377) [Java] Support unsigned dictionary indices

2020-07-08 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9377:
---

 Summary: [Java] Support unsigned dictionary indices
 Key: ARROW-9377
 URL: https://issues.apache.org/jira/browse/ARROW-9377
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Wes McKinney


child of ARROW-9259



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9348) [C++] Replace usages of TestBase::MakeRandomArray in testing/gtest_util.h with RandomArrayGenerator

2020-07-07 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9348:
---

 Summary: [C++] Replace usages of TestBase::MakeRandomArray in 
testing/gtest_util.h with RandomArrayGenerator 
 Key: ARROW-9348
 URL: https://issues.apache.org/jira/browse/ARROW-9348
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This would be good code cleanliness



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9326) [Python] Setuptools 49.1.0 appears to break our Python 3.6 builds

2020-07-04 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9326:
---

 Summary: [Python] Setuptools 49.1.0 appears to break our Python 
3.6 builds
 Key: ARROW-9326
 URL: https://issues.apache.org/jira/browse/ARROW-9326
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


Not sure who thought it was a good idea to release setuptools on July 3, a 
holiday in the United States, but it appears to be breaking some of our builds

https://github.com/apache/arrow/pull/7539/checks?check_run_id=835994558

{code}
  File 
"/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py",
 line 297, in run
self.find_sources()
  File 
"/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py",
 line 304, in find_sources
mm.run()
  File 
"/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py",
 line 535, in run
self.add_defaults()
  File 
"/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/command/egg_info.py",
 line 571, in add_defaults
sdist.add_defaults(self)
  File 
"/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/_distutils/command/sdist.py",
 line 228, in add_defaults
self._add_defaults_ext()
  File 
"/opt/conda/envs/arrow/lib/python3.6/site-packages/setuptools/_distutils/command/sdist.py",
 line 312, in _add_defaults_ext
self.filelist.extend(build_ext.get_source_files())
  File "/opt/conda/envs/arrow/lib/python3.6/distutils/command/build_ext.py", 
line 420, in get_source_files
self.check_extensions_list(self.extensions)
  File "/opt/conda/envs/arrow/lib/python3.6/distutils/command/build_ext.py", 
line 362, in check_extensions_list
"each element of 'ext_modules' option must be an "
distutils.errors.DistutilsSetupError: each element of 'ext_modules' option must 
be an Extension instance or 2-tuple
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9305) [Python] Dependency load failure in Windows wheel build

2020-07-02 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9305:
---

 Summary: [Python] Dependency load failure in Windows wheel build
 Key: ARROW-9305
 URL: https://issues.apache.org/jira/browse/ARROW-9305
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 1.0.0


The Windows wheels are experiencing a DLL load failure probably due to one of 
the dependencies



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9304) [C++] Add "AppendEmptyValue" builder APIs for use inside StructBuilder::AppendNull

2020-07-02 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9304:
---

 Summary: [C++] Add "AppendEmptyValue" builder APIs for use inside 
StructBuilder::AppendNull
 Key: ARROW-9304
 URL: https://issues.apache.org/jira/browse/ARROW-9304
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


StructBuilder should probably also add "UnsafeAppendNull" so that there is the 
option of using the Unsafe* operations on the children



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9287) [C++] Implement support for unsigned dictionary indices

2020-06-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9287:
---

 Summary: [C++] Implement support for unsigned dictionary indices
 Key: ARROW-9287
 URL: https://issues.apache.org/jira/browse/ARROW-9287
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


Follow on work from ARROW-9259



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9286) [C++] Add function "aliases" to compute::FunctionRegistry

2020-06-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9286:
---

 Summary: [C++] Add function "aliases" to compute::FunctionRegistry
 Key: ARROW-9286
 URL: https://issues.apache.org/jira/browse/ARROW-9286
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


The purpose of aliases would be to avoid breaking APIs when/if functions are 
renamed in between releases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9285) [C++] Detect unauthorized memory allocations in function kernels

2020-06-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9285:
---

 Summary: [C++] Detect unauthorized memory allocations in function 
kernels
 Key: ARROW-9285
 URL: https://issues.apache.org/jira/browse/ARROW-9285
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


If a function has been configured to preallocate space, then executing the 
kernel should not replace the preallocated buffer during execution -- this is 
an implementation error. Detecting this would be relatively easy and improve 
debugging for kernel implementers



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9278) [C++] Implement Union validity bitmap changes from ARROW-9222

2020-06-30 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9278:
---

 Summary: [C++] Implement Union validity bitmap changes from 
ARROW-9222
 Key: ARROW-9278
 URL: https://issues.apache.org/jira/browse/ARROW-9278
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9270) [C++] Create Buffer to represent a slice of a ResizableBuffer that may yet reallocate

2020-06-29 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9270:
---

 Summary: [C++] Create Buffer to represent a slice of a 
ResizableBuffer that may yet reallocate
 Key: ARROW-9270
 URL: https://issues.apache.org/jira/browse/ARROW-9270
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney


One problem with slicing ResizableBuffer is that slices of a buffer that is 
still changing may be invalidated after a resize operation. I'd be interested 
in having a way to slice a ResizableBuffer where the slice is still usable 
after a resize. This would presume, of course, that the code responsible for 
the parent and child buffers behaves appropriately (e.g. if you call 
{{child->data()}} and then resize the parent, the pointer may become invalid



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9265) [C++] Add support for writing MetadataVersion::V4-compatible IPC messages for compatibility with library versions <= 0.17.1

2020-06-29 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9265:
---

 Summary: [C++] Add support for writing 
MetadataVersion::V4-compatible IPC messages for compatibility with library 
versions <= 0.17.1
 Key: ARROW-9265
 URL: https://issues.apache.org/jira/browse/ARROW-9265
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


While we need to increment the MetadataVersion, we should not strand old 
library versions since V4 is backward compatible with V5. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9260) [CI] "ARM64v8 Ubuntu 20.04 C++" fails

2020-06-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9260:
---

 Summary: [CI] "ARM64v8 Ubuntu 20.04 C++" fails
 Key: ARROW-9260
 URL: https://issues.apache.org/jira/browse/ARROW-9260
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Wes McKinney
 Fix For: 1.0.0


This GHA build should be disabled until it is passing reliably, e.g. 
https://github.com/apache/arrow/runs/816007838. This seems to be similar to the 
Travis CI failure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9259) [Format] Permit unsigned dictionary indices in Columnar.rst

2020-06-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9259:
---

 Summary: [Format] Permit unsigned dictionary indices in 
Columnar.rst
 Key: ARROW-9259
 URL: https://issues.apache.org/jira/browse/ARROW-9259
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9258) [Format] Add V5 MetadataVersion

2020-06-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9258:
---

 Summary: [Format] Add V5 MetadataVersion
 Key: ARROW-9258
 URL: https://issues.apache.org/jira/browse/ARROW-9258
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
 Fix For: 1.0.0


Per mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9257) [CI] Fix Travis CI builds

2020-06-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9257:
---

 Summary: [CI] Fix Travis CI builds
 Key: ARROW-9257
 URL: https://issues.apache.org/jira/browse/ARROW-9257
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney


These are being allowed to fail on master. I am not sure what's wrong with them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9254) [C++] Factor out some integer casting internals so it can be reused with temporal casts

2020-06-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9254:
---

 Summary: [C++] Factor out some integer casting internals so it can 
be reused with temporal casts
 Key: ARROW-9254
 URL: https://issues.apache.org/jira/browse/ARROW-9254
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


The "CastNumberToNumberUnsafe" function can be shared outside of 
scalar_cast_numeric.cc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9253) [C++] Add vectorized "IntegersMultipleOf" to arrow/util/int_util.h

2020-06-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9253:
---

 Summary: [C++] Add vectorized "IntegersMultipleOf" to 
arrow/util/int_util.h
 Key: ARROW-9253
 URL: https://issues.apache.org/jira/browse/ARROW-9253
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


There are various places where we check whether an array of integers are all a 
multiple of another number (e.g. a multiple of 8640 milliseconds per day). 
It would be better to factor out this data check into a reusable function 
similar to the {{CheckIntegersInRange}} function



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9252) [Integration] GitHub Actions integration test job does not test against "gold" 0.14.1 files in apache/arrow-testing

2020-06-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9252:
---

 Summary: [Integration] GitHub Actions integration test job does 
not test against "gold" 0.14.1 files in apache/arrow-testing
 Key: ARROW-9252
 URL: https://issues.apache.org/jira/browse/ARROW-9252
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 1.0.0


I'm not sure when and why this was dropped but it is critical that these tests 
from 

https://github.com/apache/arrow/commit/26d72f328b82bcff4e074109a5f905ebf069a416#diff-776ea3bf11df5829827f7afb43c37174

are restored



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9251) [C++] Move JSON testing code for integration tests to libarrow_testing

2020-06-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9251:
---

 Summary: [C++] Move JSON testing code for integration tests to 
libarrow_testing
 Key: ARROW-9251
 URL: https://issues.apache.org/jira/browse/ARROW-9251
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


This code contributes over 700KB to release builds and is never used

{code}
-rw--- 1 wesm wesm  34104 Jun 27 11:14 dictionary.cc.o
-rw--- 1 wesm wesm 199592 Jun 27 11:14 feather.cc.o
-rw--- 1 wesm wesm  63448 Jun 27 11:14 json_integration.cc.o
-rw--- 1 wesm wesm 727336 Jun 27 11:14 json_internal.cc.o
-rw--- 1 wesm wesm 828056 Jun 27 11:14 json_simple.cc.o
-rw--- 1 wesm wesm 185344 Jun 27 11:14 message.cc.o
-rw--- 1 wesm wesm 223592 Jun 27 11:14 metadata_internal.cc.o
-rw--- 1 wesm wesm   3416 Jun 27 11:14 options.cc.o
-rw--- 1 wesm wesm 557960 Jun 27 11:14 reader.cc.o
-rw--- 1 wesm wesm 285744 Jun 27 11:14 writer.cc.o
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-9250) [C++] Compact generated code in compute/kernels/scalar_set_lookup.cc using same method as vector_hash.cc

2020-06-27 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-9250:
---

 Summary: [C++] Compact generated code in 
compute/kernels/scalar_set_lookup.cc using same method as vector_hash.cc
 Key: ARROW-9250
 URL: https://issues.apache.org/jira/browse/ARROW-9250
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 1.0.0


This module can be made to compile smaller and faster by using common kernels 
for types having the same binary representation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 15719 matches

Mail list logo