[jira] [Created] (ARROW-10129) [Rust] Cargo build is rebuilding dependencies on arrow changes

2020-09-29 Thread Jorge (Jira)
Jorge created ARROW-10129:
-

 Summary: [Rust] Cargo build is rebuilding dependencies on arrow 
changes
 Key: ARROW-10129
 URL: https://issues.apache.org/jira/browse/ARROW-10129
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Affects Versions: 1.0.1
Reporter: Jorge
 Fix For: 2.0.0


There is a potential issue in the dependencies causing rust to re-build them on 
changes of the arrow crate.

I was unable to fully grasp what is going on, but this seems to be a re-surface 
of ARROW-9600



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10128) [Rust] Dictionary-encoding is out of spec

2020-09-28 Thread Jorge (Jira)
Jorge created ARROW-10128:
-

 Summary: [Rust] Dictionary-encoding is out of spec
 Key: ARROW-10128
 URL: https://issues.apache.org/jira/browse/ARROW-10128
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Jorge


According to [the 
spec|https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout],
 every array can be dictionary-encoded, on which its values are encoded by a 
unique set of values.

However, none of our arrays support this encoding and the physical memory 
layout of this encoding is not being fulfilled.

We have a DictionaryArray, but, AFAIK, it does not respect the physical memory 
layout set out by the spec.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10113) [Rust] Implement conversion of array::Array to ArrowArray

2020-09-27 Thread Jorge (Jira)
Jorge created ARROW-10113:
-

 Summary: [Rust] Implement conversion of array::Array to ArrowArray
 Key: ARROW-10113
 URL: https://issues.apache.org/jira/browse/ARROW-10113
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10112) [Rust] Implement conversion of ArrowArray to array::Array

2020-09-27 Thread Jorge (Jira)
Jorge created ARROW-10112:
-

 Summary: [Rust] Implement conversion of ArrowArray to array::Array
 Key: ARROW-10112
 URL: https://issues.apache.org/jira/browse/ARROW-10112
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10111) [Rust] Add basics to test round-trip of consumption to a Rust struct

2020-09-27 Thread Jorge (Jira)
Jorge created ARROW-10111:
-

 Summary: [Rust] Add basics to test round-trip of consumption to a 
Rust struct
 Key: ARROW-10111
 URL: https://issues.apache.org/jira/browse/ARROW-10111
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10110) [Rust] Add support to consume C Data Interface

2020-09-27 Thread Jorge (Jira)
Jorge created ARROW-10110:
-

 Summary: [Rust] Add support to consume C Data Interface
 Key: ARROW-10110
 URL: https://issues.apache.org/jira/browse/ARROW-10110
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Jorge
Assignee: Jorge
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10109) [Rust] Add support to produce and consume a C Data interface

2020-09-27 Thread Jorge (Jira)
Jorge created ARROW-10109:
-

 Summary: [Rust] Add support to produce and consume a C Data 
interface
 Key: ARROW-10109
 URL: https://issues.apache.org/jira/browse/ARROW-10109
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jorge
Assignee: Jorge
 Fix For: 3.0.0


The goal of this issue is to support importing C Data arrays into Rust via FFI.

 

The use-case that motivated this issue was the possibility of running 
DataFusion from Python and support moving arrays from DataFusion to 
Python/Pyarray and vice-versa.

In particular, so that users can write Python UDFs that expect arrow arrays and 
return arrow arrays, in the same spirit as pandas-udfs in Spark work for Pandas.

The brute-force way of writing these arrays is by converting element by element 
from and to native types. The efficient way of doing it to pass the memory 
address from and to each implementation, which is zero-copy.

To support the latter, we need an FFI implementation in Rust that produces and 
consumes [C's Data 
interface|https://arrow.apache.org/docs/format/CDataInterface.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10096) [Rust] [DataFusion] Remove unused code

2020-09-25 Thread Jorge (Jira)
Jorge created ARROW-10096:
-

 Summary: [Rust] [DataFusion] Remove unused code
 Key: ARROW-10096
 URL: https://issues.apache.org/jira/browse/ARROW-10096
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


The CsvBatchIterator on datafusion::datasources::csv is unused, and seems to be 
duplicated code from physical_plan::csv.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10086) [Rust] Migrate min_large_string -> min_string kernels

2020-09-24 Thread Jorge (Jira)
Jorge created ARROW-10086:
-

 Summary: [Rust] Migrate min_large_string -> min_string kernels
 Key: ARROW-10086
 URL: https://issues.apache.org/jira/browse/ARROW-10086
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Currently, the kernels are named 

`min_string`, `max_string`, `max_large_string`, `min_large_string`, but this is 
no longer needed, as strings are not the same struct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10084) [Rust] [DataFusion] Add length of large string array

2020-09-24 Thread Jorge (Jira)
Jorge created ARROW-10084:
-

 Summary: [Rust] [DataFusion] Add length of large string array
 Key: ARROW-10084
 URL: https://issues.apache.org/jira/browse/ARROW-10084
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10065) [Rust] DRY downcasted Arrays

2020-09-22 Thread Jorge (Jira)
Jorge created ARROW-10065:
-

 Summary: [Rust] DRY downcasted Arrays
 Key: ARROW-10065
 URL: https://issues.apache.org/jira/browse/ARROW-10065
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge
Assignee: Jorge


Currently, we have a significant amount of repeated code to cover slight 
variations of downcasted arrays:
 * BinaryArray
 * LargeBinaryArray
 * ListArray
 * StringArray
 * LargeStringArray

The goal of this issue is to clean up repeated code around.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10060) [Rust] MergeExec currently discards partitions with errors

2020-09-21 Thread Jorge (Jira)
Jorge created ARROW-10060:
-

 Summary: [Rust] MergeExec currently discards partitions with errors
 Key: ARROW-10060
 URL: https://issues.apache.org/jira/browse/ARROW-10060
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jorge
Assignee: Jorge


IMO they should be accounted for.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10048) [Rust] Error in aggregate of min/max for strings

2020-09-20 Thread Jorge (Jira)
Jorge created ARROW-10048:
-

 Summary: [Rust] Error in aggregate of min/max for strings
 Key: ARROW-10048
 URL: https://issues.apache.org/jira/browse/ARROW-10048
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


There is an error in computing the min/max of strings with nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10046) [Rust] [DataFusion] Made `*Iterator` implement Iterator

2020-09-20 Thread Jorge (Jira)
Jorge created ARROW-10046:
-

 Summary: [Rust] [DataFusion] Made `*Iterator` implement Iterator
 Key: ARROW-10046
 URL: https://issues.apache.org/jira/browse/ARROW-10046
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Currently, we use {{next_batch -> Result>}} to iterate over 
batches. However, this is very similar to the iterator pattern that Rust offers.

This issue concerns migrating our code from {{next() -> 
Option>}}

on the trait Iterator, so that we can leverage the rich Rust iterator API.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10044) [Rust] Improve README

2020-09-19 Thread Jorge (Jira)
Jorge created ARROW-10044:
-

 Summary: [Rust] Improve README
 Key: ARROW-10044
 URL: https://issues.apache.org/jira/browse/ARROW-10044
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10042) [Rust] Buffer equalities are incorrect

2020-09-19 Thread Jorge (Jira)
Jorge created ARROW-10042:
-

 Summary: [Rust] Buffer equalities are incorrect
 Key: ARROW-10042
 URL: https://issues.apache.org/jira/browse/ARROW-10042
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 1.0.1
Reporter: Jorge


Two (byte) buffers are equal if their contents are equal.

However, currently, {{BufferData}} comparison ({{PartialEq}}) uses its 
{{capacity}} as part of the comparison. Therefore, two slices are considered 
different even when their content is the same but the corresponding 
{{BufferData}} has a different capacity.

Since this equality is used by {{Buffer}}'s equality, which is used by all our 
arrays, currently arrays are different when their contents are the same but the 
underlying capacity is different. I am not sure that this is what we want.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10039) [Rust] Do not required memory alignment of buffers

2020-09-18 Thread Jorge (Jira)
Jorge created ARROW-10039:
-

 Summary: [Rust] Do not required memory alignment of buffers
 Key: ARROW-10039
 URL: https://issues.apache.org/jira/browse/ARROW-10039
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Currently, all buffers in rust are created aligned. We also panic when a buffer 
is not aligned. However, [the 
spec|https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding]
 only recommends, not requires, memory alignment. 

This issue aims at following the spec more closely, by removing the requirement 
(not panicking) when memory passed to a buffer is not aligned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10036) [Rust] [DataFusion] Test that the final schema is expected in integration tests

2020-09-17 Thread Jorge (Jira)
Jorge created ARROW-10036:
-

 Summary: [Rust] [DataFusion] Test that the final schema is 
expected in integration tests
 Key: ARROW-10036
 URL: https://issues.apache.org/jira/browse/ARROW-10036
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


Currently, our integration tests convert a Recordbatch to a string, which we 
use for testing, but they do not test that the final schema matches our 
expectations.

We should add a test for this, which includes:
 # field name
 # field type
 # field nulability

for every field in the schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10030) [Rust] Support fromIter and toIter

2020-09-17 Thread Jorge (Jira)
Jorge created ARROW-10030:
-

 Summary: [Rust] Support fromIter and toIter
 Key: ARROW-10030
 URL: https://issues.apache.org/jira/browse/ARROW-10030
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge


Proposal for comments: 
https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing

 

(dump of the proposal:)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common 
operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers 
accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted 
array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()


 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator and Iterator>
 * Implement FromIterator and Item=Option

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are 
efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only 
native Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10028) [Rust] Simplify macro def_numeric_from_vec

2020-09-16 Thread Jorge (Jira)
Jorge created ARROW-10028:
-

 Summary: [Rust] Simplify macro def_numeric_from_vec
 Key: ARROW-10028
 URL: https://issues.apache.org/jira/browse/ARROW-10028
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Currently we need to pass too many parameters to it, when they can be inferred.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10020) [Rust] Implement math kernels

2020-09-15 Thread Jorge (Jira)
Jorge created ARROW-10020:
-

 Summary: [Rust] Implement math kernels
 Key: ARROW-10020
 URL: https://issues.apache.org/jira/browse/ARROW-10020
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge


Some of these kernels are in DataFusion, and could be migrated to Rust.

 

Note that packed_simd has implementations for some of these operations with 
SIMD, which we could leverage.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10019) [Rust] Add substring kernel

2020-09-15 Thread Jorge (Jira)
Jorge created ARROW-10019:
-

 Summary: [Rust] Add substring kernel
 Key: ARROW-10019
 URL: https://issues.apache.org/jira/browse/ARROW-10019
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jorge
Assignee: Jorge


substring returns a substring of a StringArray starting at a given index, and 
with a given optional length.

{{fn substring(array: , start: i32, length: ) -> 
Result}}

This operation is common in strings, and it is useful for string-based 
transformations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10016) [Rust] [DataFusion] Implement IsNull and IsNotNull

2020-09-15 Thread Jorge (Jira)
Jorge created ARROW-10016:
-

 Summary: [Rust] [DataFusion] Implement IsNull and IsNotNull
 Key: ARROW-10016
 URL: https://issues.apache.org/jira/browse/ARROW-10016
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Jorge


Currently, DataFusion has the logical operator `isNull` and `IsNotNull`, but 
that operator has no physical implementation. Consequently, this operator 
cannot be used.

The goal of this improvement is to add support to this operator on the physical 
plan.

Note that this operator only cares about the null bitmap, and thus should be 
implementable to all types supported by Arrow.

Both operators should return a non-null `BooleanArray`.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10015) [Rust] Implement SIMD for aggregate kernel sum

2020-09-15 Thread Jorge (Jira)
Jorge created ARROW-10015:
-

 Summary: [Rust] Implement SIMD for aggregate kernel sum
 Key: ARROW-10015
 URL: https://issues.apache.org/jira/browse/ARROW-10015
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge


Currently, our aggregations are made in a simple loop. However, as described 
[here|https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html], 
horizontal operations can also be SIMDed, reports of 2.7x speedups.

The goal of this improvement is to support SIMD for the "sum", for primitive 
types.

The code to modify is in 
[here|https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/aggregate.rs].
 A good indication that this issue is completed is when the script

{{cargo bench --bench aggregate_kernels && cargo bench --bench 
aggregate_kernels --features simd}}

yields a speed-up.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10010) [Rust] Speedup arithmetic

2020-09-14 Thread Jorge (Jira)
Jorge created ARROW-10010:
-

 Summary: [Rust] Speedup arithmetic
 Key: ARROW-10010
 URL: https://issues.apache.org/jira/browse/ARROW-10010
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge


There are some optimizations possible in arithmetics kernels.

 

PR to follow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README

2020-09-14 Thread Jorge (Jira)
Jorge created ARROW-10001:
-

 Summary: [Rust] [DataFusion] Add developer guide to README
 Key: ARROW-10001
 URL: https://issues.apache.org/jira/browse/ARROW-10001
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9990) [Rust] [DataFusion] NOT is not plannable

2020-09-13 Thread Jorge (Jira)
Jorge created ARROW-9990:


 Summary: [Rust] [DataFusion] NOT is not plannable
 Key: ARROW-9990
 URL: https://issues.apache.org/jira/browse/ARROW-9990
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jorge
Assignee: Jorge


We have the physical operator, but it is not usable in the logical planning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9988) [Rust] [DataFusion] Added std::ops to logical expressions

2020-09-13 Thread Jorge (Jira)
Jorge created ARROW-9988:


 Summary: [Rust] [DataFusion] Added std::ops to logical expressions
 Key: ARROW-9988
 URL: https://issues.apache.org/jira/browse/ARROW-9988
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


So that we can write {{col("a") + col("b")}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9987) [Rust] [DataFusion] Improve docs of `Expr`.

2020-09-13 Thread Jorge (Jira)
Jorge created ARROW-9987:


 Summary: [Rust] [DataFusion] Improve docs of `Expr`.
 Key: ARROW-9987
 URL: https://issues.apache.org/jira/browse/ARROW-9987
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9984) [Rust] [DataFusion] DRY of function to string

2020-09-12 Thread Jorge (Jira)
Jorge created ARROW-9984:


 Summary: [Rust] [DataFusion] DRY of function to string
 Key: ARROW-9984
 URL: https://issues.apache.org/jira/browse/ARROW-9984
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9977) [Rust] Add min/max for [Large]String

2020-09-11 Thread Jorge (Jira)
Jorge created ARROW-9977:


 Summary: [Rust] Add min/max for [Large]String
 Key: ARROW-9977
 URL: https://issues.apache.org/jira/browse/ARROW-9977
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Strings are ordered and thus we can apply min/max as other types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9971) [Rust] Speedup take

2020-09-10 Thread Jorge (Jira)
Jorge created ARROW-9971:


 Summary: [Rust] Speedup take
 Key: ARROW-9971
 URL: https://issues.apache.org/jira/browse/ARROW-9971
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9966) [Rust] Speedup aggregate kernels

2020-09-10 Thread Jorge (Jira)
Jorge created ARROW-9966:


 Summary: [Rust] Speedup aggregate kernels
 Key: ARROW-9966
 URL: https://issues.apache.org/jira/browse/ARROW-9966
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning

2020-09-09 Thread Jorge (Jira)
Jorge created ARROW-9954:


 Summary: [Rust] [DataFusion] Simplify code of aggregate planning
 Key: ARROW-9954
 URL: https://issues.apache.org/jira/browse/ARROW-9954
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry

2020-09-08 Thread Jorge (Jira)
Jorge created ARROW-9950:


 Summary: [Rust] [DataFusion] Allow UDF usage without registry
 Key: ARROW-9950
 URL: https://issues.apache.org/jira/browse/ARROW-9950
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


This is a functionality relevant only for the DataFrame API.

Sometimes a UDF declaration happens during planning, and it makes it very 
expressive when the user does not have to access the registry at all to plan 
the UDF.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9937) [Rust] [DataFusion] Average is not correct

2020-09-08 Thread Jorge (Jira)
Jorge created ARROW-9937:


 Summary: [Rust] [DataFusion] Average is not correct
 Key: ARROW-9937
 URL: https://issues.apache.org/jira/browse/ARROW-9937
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge


The current design of aggregates makes the calculation of the average incorrect.
It also makes it impossible to compute the [geometric 
mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other 
operations. 

The central issue is that Accumulator returns a `ScalarValue` during partial 
aggregations via {{get_value}}, but very often a `ScalarValue` is not 
sufficient information to perform the full aggregation.

A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are 
distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current 
calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then 
reduces them using another average, i.e.

{{((x1+x2)/2 + (x3+x4)/2 + x5)/3}}

which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}.

I believe that our Accumulators need to pass more information from the partial 
aggregations to the final aggregation.

We could consider taking an API equivalent to 
[spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), 
i.e. have an `update`, a `merge` and an `evaluate`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9922) [Rust] Add `try_from(Vec>)` to StructArray

2020-09-06 Thread Jorge (Jira)
Jorge created ARROW-9922:


 Summary: [Rust] Add `try_from(Vec>)` to 
StructArray
 Key: ARROW-9922
 URL: https://issues.apache.org/jira/browse/ARROW-9922
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray

2020-09-05 Thread Jorge (Jira)
Jorge created ARROW-9921:


 Summary: [Rust] Add `from(Vec>)` to [Large]StringArray
 Key: ARROW-9921
 URL: https://issues.apache.org/jira/browse/ARROW-9921
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


and deprecate TryFrom, that currently uses a builder.

This should have some performance improvement, and simplifies the interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9919) [Rust] [DataFusion] Math functions

2020-09-05 Thread Jorge (Jira)
Jorge created ARROW-9919:


 Summary: [Rust] [DataFusion] Math functions
 Key: ARROW-9919
 URL: https://issues.apache.org/jira/browse/ARROW-9919
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Jorge
Assignee: Jorge


See main issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9918) [Rust] [DataFusion] Move away from builders to create arrays

2020-09-05 Thread Jorge (Jira)
Jorge created ARROW-9918:


 Summary: [Rust] [DataFusion] Move away from builders to create 
arrays
 Key: ARROW-9918
 URL: https://issues.apache.org/jira/browse/ARROW-9918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


[~yordan-pavlov]  hinted in https://issues.apache.org/jira/browse/ARROW-8908 
that there is a performance hit when using array builders.

Indeed, I get about 15% speedup by using {{From}} instead of a builder in 
DataFusion's math operations, and this number _increases with batch size_, 
which is detrimental since larger batch sizes leverage SDMI better.

Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch 
size:

 
{code:java}
sqrt_20_12  time:   [34.422 ms 34.503 ms 34.584 ms] 
  
change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_12  time:   [150.13 ms 150.79 ms 151.42 ms] 
  
change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05)
Performance has improved.

sqrt_22_14  time:   [151.45 ms 151.68 ms 151.90 ms] 
  
change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05)
Performance has improved.
{code}
See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same 
batch size), but a measurable difference between {{sqrt_22_12}} and 
{{sqrt_22_14}} (different batch size).

This issue proposes that we migrate our datafusion and rust operations from 
builder-base array construction to a non-builder based, whenever possible.

 

Most code can be easily migrated to use {{From}} trait, a migration is along 
the lines of

 
{code:java}
let mut builder = Float64Builder::new(array.len());
for i in 0..array.len() {
if array.is_null(i) {
builder.append_null()?;
} else {
builder.append_value(array.value(i).$FUNC())?;
}
}
Ok(Arc::new(builder.finish()))
{code}
 to

 
{code:java}
let result: Float64Array = (0..array.len())
.map(|i| {
if array.is_null(i) {
None
} else {
Some(array.value(i).$FUNC())
}
})
.collect::>>()
.into();
Ok(Arc::new(result)){code}
and this is even simpler as we do not need to use an extra API (Builder).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9910) [Rust] [DataFusion] Type coercion of Variadic is wrong

2020-09-03 Thread Jorge (Jira)
Jorge created ARROW-9910:


 Summary: [Rust] [DataFusion] Type coercion of Variadic is wrong
 Key: ARROW-9910
 URL: https://issues.apache.org/jira/browse/ARROW-9910
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Specifically, the case

 
{code:java}
// common type is u64
case(
vec![DataType::UInt32, DataType::UInt64],
Signature::Variadic(vec![DataType::UInt32, DataType::UInt64]),
vec![DataType::UInt64, DataType::UInt64],
)?,
{code}
fails.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9902) [Rust] [DataFusion] Add support for array()

2020-09-02 Thread Jorge (Jira)
Jorge created ARROW-9902:


 Summary: [Rust] [DataFusion] Add support for array()
 Key: ARROW-9902
 URL: https://issues.apache.org/jira/browse/ARROW-9902
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


`array` is a function that takes an arbitrary number of columns and returns a 
fixed-size list with their values.

This function is notoriously difficult to implement because it receives an 
arbitrary number of arguments or arbitrary but common types, but it is also 
useful for e.g. time-series data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9892) [Rust] [DataFusion] Add support for concat

2020-09-01 Thread Jorge (Jira)
Jorge created ARROW-9892:


 Summary: [Rust] [DataFusion] Add support for concat
 Key: ARROW-9892
 URL: https://issues.apache.org/jira/browse/ARROW-9892
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


So that we can concatenate strings together.

{{pub fn concat(args: Vec) -> Expr}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32

2020-09-01 Thread Jorge (Jira)
Jorge created ARROW-9891:


 Summary: [Rust] [DataFusion] Make math functions support f32
 Key: ARROW-9891
 URL: https://issues.apache.org/jira/browse/ARROW-9891
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)).

The goal of this issue is to make the operation be cast(g(f32) AS f64) instead.

Since computations on f32 are faster than on f64, this is a simple optimization.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9887) [Rust] [DataFusion] Add support for complex return types of built-in functions

2020-08-30 Thread Jorge (Jira)
Jorge created ARROW-9887:


 Summary: [Rust] [DataFusion] Add support for complex return types 
of built-in functions
 Key: ARROW-9887
 URL: https://issues.apache.org/jira/browse/ARROW-9887
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9886) [Rust] [DataFusion] Simplify code to test cast

2020-08-29 Thread Jorge (Jira)
Jorge created ARROW-9886:


 Summary: [Rust] [DataFusion] Simplify code to test cast
 Key: ARROW-9886
 URL: https://issues.apache.org/jira/browse/ARROW-9886
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


We have 3 tests with similar functionality, but that only vary on the types 
they test. Let's create a macro to apply to all of them, so that the tests are 
equivalent and DRY.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9885) Simplify type coercion for binary types

2020-08-28 Thread Jorge (Jira)
Jorge created ARROW-9885:


 Summary: Simplify type coercion for binary types
 Key: ARROW-9885
 URL: https://issues.apache.org/jira/browse/ARROW-9885
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


The function `numerical_coercion` only uses the operator `op` for its error 
formatting. But the function's intent can be simply generalized to "coerce two 
types to numerically equivalent types".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field

2020-08-25 Thread Jorge (Jira)
Jorge created ARROW-9849:


 Summary: [Rust] [DataFusion] Make UDFs not need a Field
 Key: ARROW-9849
 URL: https://issues.apache.org/jira/browse/ARROW-9849
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


[https://github.com/apache/arrow/pull/7967,] shows that it is possible to not 
require users to pass a `Field` to UDFs declarations and instead just pass a 
`DataType`.

Let's deprecate Field from them, and instead just use `DataType`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9836) [Rust] [DataFusion] Improve API for usage of UDFs

2020-08-23 Thread Jorge (Jira)
Jorge created ARROW-9836:


 Summary: [Rust] [DataFusion] Improve API for usage of UDFs
 Key: ARROW-9836
 URL: https://issues.apache.org/jira/browse/ARROW-9836
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge


TL;DR; currently, users call UDFs through
 
{color:#00}df.select(scalar_functions(“sqrt”, vec![col(“a”)], 
DataType::Float64)){color}
 
Proposal:
 
{color:#00}let udf = df.registry()?;{color}

{color:#00}df.select(udf(“sqrt”, vec![col(“a”)])?){color}
 
so that they do not have to remember the UDFs return type when using it.
 
This API will in the future allow to declare the UDF as part of the planning, 
like spark, instead of having to register it in the registry before using it 
(we just need to check if the UDF is registered or not before doing so).
See complete proposal here: 
[https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9835) [Rust] [DataFusion] Remove FunctionMeta

2020-08-23 Thread Jorge (Jira)
Jorge created ARROW-9835:


 Summary: [Rust] [DataFusion] Remove FunctionMeta
 Key: ARROW-9835
 URL: https://issues.apache.org/jira/browse/ARROW-9835
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge


Currently, our code works as follows:

1. Declare a UDF via {{udf::ScalarFunction}}
 2. Register the UDF
 3. call {{scalar_function("name", vec![...], type)}} to use it during planning.

However, during planning, we:

1. Get the ScalarFunction by name
 2. convert it to FunctionMetadata
 3. get the ScalarFunction associated with FunctionMetadata's name

I.e. {{FunctionMetadata}} does not seem to be needed.

Goal: remove {{FunctionMetadata}} and just pass {{udf::ScalarFunction}} 
directly from the registry to the physical plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9831) [Rust] [DataFusion] Fix compilation error

2020-08-22 Thread Jorge (Jira)
Jorge created ARROW-9831:


 Summary: [Rust] [DataFusion] Fix compilation error
 Key: ARROW-9831
 URL: https://issues.apache.org/jira/browse/ARROW-9831
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9830) [Rust] [DataFusion] Verify that projection push down does not remove aliases columns

2020-08-22 Thread Jorge (Jira)
Jorge created ARROW-9830:


 Summary: [Rust] [DataFusion] Verify that projection push down does 
not remove aliases columns
 Key: ARROW-9830
 URL: https://issues.apache.org/jira/browse/ARROW-9830
 Project: Apache Arrow
  Issue Type: Test
  Components: Rust, Rust - DataFusion
Reporter: Jorge


See comments: https://github.com/apache/arrow/pull/7919#issuecomment-678662753




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9815) [Rust] [DataFusion] Deadlock in creation of physical plan with two udfs

2020-08-20 Thread Jorge (Jira)
Jorge created ARROW-9815:


 Summary: [Rust] [DataFusion] Deadlock in creation of physical plan 
with two udfs
 Key: ARROW-9815
 URL: https://issues.apache.org/jira/browse/ARROW-9815
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


This one took me some time to understand, but I finally have a reproducible 
example: when two udfs are called, one after the other, we cause a deadlock 
when creating the physical plan.

Example test

{code}
#[test]
fn csv_query_sqrt_sqrt() -> Result<()> {
let mut ctx = create_ctx()?;
register_aggregate_csv( ctx)?;
let sql = "SELECT sqrt(sqrt(c12)) FROM aggregate_test_100 LIMIT 1";
let actual = execute( ctx, sql);
// sqrt(sqrt(c12=0.9294097332465232)) = 0.9818650561397431
let expected = "0.9818650561397431".to_string();
assert_eq!(actual.join("\n"), expected);
Ok(())
}
{code}

I believe that this is due to the recursive nature of the physical planner, 
that locks scalar_functions within a match, which blocks the whole thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9809) [Rust] [DataFusion] logical schema = physical schema is not true

2020-08-20 Thread Jorge (Jira)
Jorge created ARROW-9809:


 Summary: [Rust] [DataFusion] logical schema = physical schema is 
not true
 Key: ARROW-9809
 URL: https://issues.apache.org/jira/browse/ARROW-9809
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge


In tests/sql.rs, we test that the physical and the optimized schema must match. 
However, this is not necessarily true for all our queries. An example:
{code:java}
#[test]
fn csv_query_sum_cast() {
let mut ctx = ExecutionContext::new();
register_aggregate_csv_by_sql( ctx);
// c8 = i32; c9 = i64
let sql = "SELECT c8 + c9 FROM aggregate_test_100";
// check that the physical and logical schemas are equal
execute( ctx, sql);
}
{code}
The physical expression (and schema) of this operation, after optimization, is 
{{CAST(c8 as Int64) Plus c9}} (this test fails).

AFAIK, the invariant of the optimizer is that the output types and nullability 
are the same.

Also, note that the reason the optimized logical schema equals the logical 
schema is that our type coercer does not change the output names of the schema, 
even though it re-writes logical expressions. I.e. after the optimization, 
`.to_field()` of an expression may no longer match the field name nor type in 
the Plan's schema. IMO this is currently by (implicit?) design, as we do not 
want our logical schema's column names to change during optimizations, or all 
column references may point to non-existent columns. This is something that 
brought up on the mailing list about polymorphism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9793) [Rust] [DataFusion] Tests failing in master

2020-08-18 Thread Jorge (Jira)
Jorge created ARROW-9793:


 Summary: [Rust] [DataFusion] Tests failing in master
 Key: ARROW-9793
 URL: https://issues.apache.org/jira/browse/ARROW-9793
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9788) Handle naming inconsistencies between SQL, DataFrame API and struct names

2020-08-18 Thread Jorge (Jira)
Jorge created ARROW-9788:


 Summary: Handle naming inconsistencies between SQL, DataFrame API 
and struct names
 Key: ARROW-9788
 URL: https://issues.apache.org/jira/browse/ARROW-9788
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Andy Grove


Currently, we have naming inconsistencies between the different APIs that make 
it a bit confusing. The typical example atm is 

`df.where().to_plan?.explain()` shows a "Selection" in the plan when 
"Selection" in SQL and many other query languages is a projection, not a filter.

Other examples:

```
name: Selection
SQL: WHERE
DF: filter
```

```
name: Aggregation
SQL: GROUP BY
DF: aggregate
```

```
name: Projection
SQL: SELECT
DF: select,select_columns
```

I suggest that we align them with a common notation, preferably aligned with 
other more common query languages.

I am assigning this to you [~andygrove] as you are probably the only person 
that can take a decision on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9779) [Rust] [DataFusion] Increase stability of average accumulator

2020-08-18 Thread Jorge (Jira)
Jorge created ARROW-9779:


 Summary: [Rust] [DataFusion] Increase stability of average 
accumulator
 Key: ARROW-9779
 URL: https://issues.apache.org/jira/browse/ARROW-9779
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Currently, our method to compute the average is based on:

1. compute sum of all terms
2. compute count of all terms
3. compute sum / count

however, the sum may overflow.

There is a typical solution to this based on an online formula described e.g. 
[here|http://www.heikohoffmann.de/htmlthesis/node134.html] to keep the numbers 
small.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests

2020-08-17 Thread Jorge (Jira)
Jorge created ARROW-9778:


 Summary: [Rust] [DataFusion] Logical and physical schemas' 
nullability does not match in 8 out of 20 end-to-end tests
 Key: ARROW-9778
 URL: https://issues.apache.org/jira/browse/ARROW-9778
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust, Rust - DataFusion
Reporter: Jorge


In `tests/sql.rs`, if we re-write the ```execute``` function to test the end 
schemas, as

```
/// Execute query and return result set as tab delimited string
fn execute(ctx:  ExecutionContext, sql: ) -> Vec {
let plan = ctx.create_logical_plan().unwrap();
let plan = ctx.optimize().unwrap();
let physical_plan = ctx.create_physical_plan().unwrap();
let results = ctx.collect(physical_plan.as_ref()).unwrap();
if results.len() > 0 {
// results must match the logical schema
assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref());
}
result_str()
}
```

we end up with 8 tests failing, which indicates that our physical and logical 
plans are not aligned. In all cases, the issue is nullability: our logical plan 
assumes nullability = true, while our physical plan may change the nullability 
field.

If we do not plan to track nullability on the logical level, we could consider 
replacing Schema by a type that does not track nullability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9756) [Rust] [DataFusion] Add support for generic return types of scalar UDFs

2020-08-16 Thread Jorge (Jira)
Jorge created ARROW-9756:


 Summary: [Rust] [DataFusion] Add support for generic return types 
of scalar UDFs
 Key: ARROW-9756
 URL: https://issues.apache.org/jira/browse/ARROW-9756
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Jorge
Assignee: Jorge


Currently, we have math functions declared as UDFs that while they can be 
evaluated against float32 and float64, their return type is fixed (float64).

This issue is about extending the UDF interface to support generic return types 
(a function), so that developers (us included), can register UDFs or variable 
return types.

There is a small overhead in this, since evaluating types now requires a 
function call, however, IMO this is still very small when compared to anything, 
as we are talking about plans. We can always apply some caching if needed.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9752) [Rist] [DataFusion] Add support for Aggregate UDFs

2020-08-15 Thread Jorge (Jira)
Jorge created ARROW-9752:


 Summary: [Rist] [DataFusion] Add support for Aggregate UDFs
 Key: ARROW-9752
 URL: https://issues.apache.org/jira/browse/ARROW-9752
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


This will allow to more easily extend the existing offering of aggregate 
functions.

The existing functions shall be migrated to this interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9751) [Rust] [DataFusion] Extend UDFs to accept more than one type per argument

2020-08-15 Thread Jorge (Jira)
Jorge created ARROW-9751:


 Summary: [Rust] [DataFusion] Extend UDFs to accept more than one 
type per argument
 Key: ARROW-9751
 URL: https://issues.apache.org/jira/browse/ARROW-9751
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9678) [Rust] [DataFusion] Improve projection push down to remove unused columns

2020-08-09 Thread Jorge (Jira)
Jorge created ARROW-9678:


 Summary: [Rust] [DataFusion] Improve projection push down to 
remove unused columns
 Key: ARROW-9678
 URL: https://issues.apache.org/jira/browse/ARROW-9678
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust, Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Currently, the projection push down only removes columns that are never 
referenced in the plan. However, sometimes a projection declares columns that 
themselves are never used.

This issue is about improving the projection push-down to remove any column 
that is not logically required by the plan.

Failing unit-test with the idea:

{code:java}
#[test]
fn table_unused_column() -> Result<()> {
let table_scan = test_table_scan()?;
assert_eq!(3, table_scan.schema().fields().len());
assert_fields_eq(_scan, vec!["a", "b", "c"]);

// we never use "b" in the first projection => remove it
let plan = LogicalPlanBuilder::from(_scan)
.project(vec![col("c"), col("a"), col("b")])?
.filter(col("c").gt((1)))?
.project(vec![col("c"), col("a")])?
.build()?;

assert_fields_eq(, vec!["c", "a"]);

let expected = "\
Projection: #c, #a\
\n  Selection: #c Gt Int32(1)\
\nProjection: #c, #a\
\n  TableScan: test projection=Some([0, 2])";

assert_optimized_plan_eq(, expected);

Ok(())
}
{code}

This issue was firstly identified by [~andygrove] 
[here|https://github.com/ballista-compute/ballista/issues/320].




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9619) [Rust] [DataFusion] Add predicate push-down

2020-08-02 Thread Jorge (Jira)
Jorge created ARROW-9619:


 Summary: [Rust] [DataFusion] Add predicate push-down
 Key: ARROW-9619
 URL: https://issues.apache.org/jira/browse/ARROW-9619
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Like the title says, add an optimizer to push filters down the plan as farther 
as logically possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9618) [Rust] [DataFusion] Make it easier to write optimizers

2020-08-02 Thread Jorge (Jira)
Jorge created ARROW-9618:


 Summary: [Rust] [DataFusion] Make it easier to write optimizers
 Key: ARROW-9618
 URL: https://issues.apache.org/jira/browse/ARROW-9618
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Currently, it is a bit painful to write optimizers as we need to cover all 
branches of a match to of logical plans.

There could be some utility functions to remove boilerplate code around these.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9617) [Rust] [DataFusion] Add length of string array

2020-08-01 Thread Jorge (Jira)
Jorge created ARROW-9617:


 Summary: [Rust] [DataFusion] Add length of string array
 Key: ARROW-9617
 URL: https://issues.apache.org/jira/browse/ARROW-9617
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


With ARROW-9615, we can port the operator to DataFusion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9615) Add kernel to compute length of string array

2020-08-01 Thread Jorge (Jira)
Jorge created ARROW-9615:


 Summary: Add kernel to compute length of string array
 Key: ARROW-9615
 URL: https://issues.apache.org/jira/browse/ARROW-9615
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9559) [Rust] [DataFusion] Revert privatization of exprlist_to_fields

2020-07-25 Thread Jorge (Jira)
Jorge created ARROW-9559:


 Summary: [Rust] [DataFusion] Revert privatization of 
exprlist_to_fields
 Key: ARROW-9559
 URL: https://issues.apache.org/jira/browse/ARROW-9559
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


This function was privatized by mistake by cd503c3f583dab.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9555) Add inner (hash) join physical plan

2020-07-24 Thread Jorge (Jira)
Jorge created ARROW-9555:


 Summary: Add inner (hash) join physical plan
 Key: ARROW-9555
 URL: https://issues.apache.org/jira/browse/ARROW-9555
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9549) [Rust] Parquet no longer builds

2020-07-24 Thread Jorge (Jira)
Jorge created ARROW-9549:


 Summary: [Rust] Parquet no longer builds
 Key: ARROW-9549
 URL: https://issues.apache.org/jira/browse/ARROW-9549
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Andy Grove


I believe that there is a typo in {{rust/parquet/Cargo.toml}}:

it reads {{arrow = { path = "../arrow", version = "1.0.0-SNAPSHOT", optional = 
true }}}, but it should read {{arrow = { path = "../arrow", version = 
"1.1.0-SNAPSHOT", optional = true }}}, or the project does not compile.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9527) [Rust] Remove un-needed dev-dependencies

2020-07-20 Thread Jorge (Jira)
Jorge created ARROW-9527:


 Summary: [Rust] Remove un-needed dev-dependencies
 Key: ARROW-9527
 URL: https://issues.apache.org/jira/browse/ARROW-9527
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


We currently have some dev-dependencies that are not needed. Let's remove them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9520) [Rust] [DataFusion] Can't alias an aggregate expression

2020-07-18 Thread Jorge (Jira)
Jorge created ARROW-9520:


 Summary: [Rust] [DataFusion] Can't alias an aggregate expression
 Key: ARROW-9520
 URL: https://issues.apache.org/jira/browse/ARROW-9520
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust - DataFusion
Reporter: Jorge


The following test (on execute) fails:

{code}
#[test]
fn aggregate_with_alias() -> Result<()> {
let results = execute("SELECT c1, COUNT(c2) AS count FROM test GROUP BY 
c1", 4)?;

assert_eq!(field_names(batch), vec!["c1", "count"]);
let expected = vec!["0,10", "1,10", "2,10", "3,10"];
let mut rows = test::format_batch();
rows.sort();
assert_eq!(rows, expected);
Ok(())
}
{code}

The root cause is that, in {{sql::planner}}, we interpret {{COUNT(c2) AS 
count}} as An {{Expr::Alias}}, which fails the {{is_aggregate_expr}} condition, 
thus being interpreted as grouped expression instead of an aggregated 
expression. This raises the Error

{{General("Projection references non-aggregate values")}}

The planner could interpret the statement above as two steps: an aggregation 
followed by a projection. Alternatively, we can allow aliases to be valid 
aggregation expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9519) [Rust] Improve error message when getting a field by name from schema

2020-07-18 Thread Jorge (Jira)
Jorge created ARROW-9519:


 Summary: [Rust] Improve error message when getting a field by name 
from schema
 Key: ARROW-9519
 URL: https://issues.apache.org/jira/browse/ARROW-9519
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Currently, when a name that does not exist on the Schema is passed to 
{{Schema::index_of}}, the error message is just the name itself. This makes it 
a bit difficult to debug.

 

I propose that we change the that error message to something like

{{Unable to get field named "nickname". Valid fields: ["first_name", 
"last_name", "address"]}}

to make it easier to understand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9516) [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes

2020-07-17 Thread Jorge (Jira)
Jorge created ARROW-9516:


 Summary: [Rust][DataFusion] Refactor physical expressions to not 
care about their names nor indexes
 Key: ARROW-9516
 URL: https://issues.apache.org/jira/browse/ARROW-9516
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


This issue covers three main topics that IMO are addressed as a whole in a 
refactor of the physical plans and expressions in data fusion. The underlying 
issues that justify this particular ticket:
h3. We currently assign poor names to the output schema.

Specifically, most names are given based on the last expression's name. 
Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the fields 
names "c, SUM, SUM".
h3. We currently derive the column names from physical expressions, not logical 
expressions

This implies that logical expressions that perform multiple operations (e.g. an 
grouped aggregation that performs partitioned aggregations + merge + final 
aggregation) have their name derived from their physical declaration, not 
logical. IMO a physical plan is an execution plan and is thus not concerned 
with naming. It is the logical plan that should be concerned with naming. 
Conceptually, a given logical plan can have more than one physical plan, e.g. 
depending on the execution environment (e.g. locally vs distributed).
h3. We currently carry the index of a column read throughout the plans, making 
it cumbersome to write optimizers.

More details here. In summary, it is possible to remove one of the optimizers 
and significantly simplify the other if columns do not carry indexing 
information.
h2. Proposal

I propose that we:
h3. drop {{physical_plan::expressions::Column::index}}

This is a major simplification of the code, and allow us to just ignore the 
position of the statement on the schema, and instead focus on its name. This is 
overall a simplification because it allow us to treat columns based solely on 
their names, and not on their position in the schema. Since SQL does not care 
about the position of the column on the table anyway (we currently already take 
the first column with that name), this seems natural.

I already prototyped this 
[here|https://github.com/jorgecarleitao/arrow/tree/column_names].

The main conclusion of this prototype is that this feasible as long as all our 
expressions get assigned a unique name, which is against what we currently 
offer (see example above). This leads me to:
h3. drop {{physical_plan::PhysicalExpr::name()}}

Currently, the name of an expression is derived from its physical plan. 
However, some operations' names are required to be known before its physical 
representation. The example I found in our current code is the grouped 
aggregation described above. If we were to build the name of our aggregation 
based on its physical plan, the name of a "COUNT(a)" operation would be 
{{SUM(COUNT(a))}} because, in the physical plan we first count on each 
partition, then merge, and them sum the counts over all partitions.

Fundamentally, IMO the issue here is that we are mixing responsibilities: the 
physical plan should not care about naming, because the physical plan 
corresponds to an execution plan, not a logical description of the column (its 
name). This leads me to:
h3. add {{logicalplan::Expr::name()}}

This will contain the name of this expression, that will naturally depend on 
the variation. Its implementation will be based on our current code for 
{{physical_plan::PhysicalExpr::name()}}.

I can take this work, but before committing, would like to know your thoughts 
about this. My initial prototyping indicate that all of this is possible and 
greatly simplifies the code, but I may be missing a design aspect of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9502) [Python][C++] Date64 converted to Date32 on parquet

2020-07-15 Thread Jorge (Jira)
Jorge created ARROW-9502:


 Summary: [Python][C++] Date64 converted to Date32 on parquet
 Key: ARROW-9502
 URL: https://issues.apache.org/jira/browse/ARROW-9502
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Jorge


Executing the example below, 

{code:python}
import datetime
import pyarrow as pa
import pyarrow.parquet

data = [
datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), 
datetime.datetime(2000, 1, 1)
]

data32 = pa.array(data, type='date32')
data64 = pa.array(data, type='date64')
table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b'])

pyarrow.parquet.write_table(table, 'a.parquet')

print(table)
print()
print(pyarrow.parquet.read_table('a.parquet'))
{code}

yields


{code:java}
pyarrow.Table
a: date32[day]
b: date64[ms]

pyarrow.Table
a: date32[day]
b: date32[day]   <--- IMO it should be date64[ms]
{code}

indicating that pyarrow converted its date64[ms] schema to date32[day]. I used 
the rust crate to print parquet's metadata, and the value is indeed stored as 
i32, which suggests that this likely happens on the writer, not reader.

IMO this does not have any practical implication because they are both dates 
and a 32 bit date (in days) can hold more dates than a 64 bit date in 
milliseconds, but still constitutes an error as the roundtrip serialization 
does not yield the same schema.

A broader question I have is why data64 exists in the first place? I can't see 
any reason to store a *date* in milliseconds since EPOCH.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9461) [Rust] Reading Date32 and Date64 errors - they are incorrectly converted to RecordBatch

2020-07-14 Thread Jorge (Jira)
Jorge created ARROW-9461:


 Summary: [Rust] Reading Date32 and Date64 errors - they are 
incorrectly converted to RecordBatch
 Key: ARROW-9461
 URL: https://issues.apache.org/jira/browse/ARROW-9461
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jorge
Assignee: Jorge


Steps to reproduce:

1. Create a file `a.parquet` using the following code:


{code:python}
import pyarrow.parquet
import numpy


def _data_datetime(f):
data = numpy.array([
numpy.datetime64('2018-08-18 23:25'),
numpy.datetime64('2019-08-18 23:25'),
numpy.datetime64("NaT")
])
data = numpy.array(data, dtype=f'datetime64[{f}]')
return data

def _write_parquet(path, data):
table = pyarrow.Table.from_arrays([pyarrow.array(data)], names=['a'])
pyarrow.parquet.write_table(table, path)
return path


_write_parquet('a.parquet', _data_datetime('D'))
{code}

2. Write a small example to read it to RecordBatches

3. observe the error {{ArrowError(ParquetError("InvalidArgumentError(\"column 
types must match schema types, expected Date32(Day) but found UInt32 at column 
index 0\")"))}}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9447) [Rust][DataFusion] Allow closures as ScalarUDFs

2020-07-13 Thread Jorge (Jira)
Jorge created ARROW-9447:


 Summary: [Rust][DataFusion] Allow closures as ScalarUDFs
 Key: ARROW-9447
 URL: https://issues.apache.org/jira/browse/ARROW-9447
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Jorge
Assignee: Jorge


Currently the ScalarUDF requires to be a function. However, most applications 
would prefer to declare a UDF as a closure.

Let's add support for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9427) [Rust][DataFusion] Add pub fn ExecutionContext.tables()

2020-07-13 Thread Jorge (Jira)
Jorge created ARROW-9427:


 Summary: [Rust][DataFusion] Add pub fn ExecutionContext.tables()
 Key: ARROW-9427
 URL: https://issues.apache.org/jira/browse/ARROW-9427
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Jorge


This allows users to know what names can be passed to {{table()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9425) [Rust][DataFusion] Make ExecutionContext sharable between threads

2020-07-13 Thread Jorge (Jira)
Jorge created ARROW-9425:


 Summary: [Rust][DataFusion] Make ExecutionContext sharable between 
threads
 Key: ARROW-9425
 URL: https://issues.apache.org/jira/browse/ARROW-9425
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - DataFusion
Reporter: Jorge


I have been playing with pyo3 to ship and call this library directly from 
Python, and the current ExecutionContext can't be used within Python with 
threads due to the usage of `Box` on `datasources`.

This issue likely affects integration of DataFusion with other libraries.

I propose that we allow ExecutionContext to be sharable between threads.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9423) Add join

2020-07-12 Thread Jorge (Jira)
Jorge created ARROW-9423:


 Summary: Add join
 Key: ARROW-9423
 URL: https://issues.apache.org/jira/browse/ARROW-9423
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust - DataFusion
Reporter: Jorge


A major operation in analytics is the join. This issue concerns adding the join 
operation.

Given the complexity of this task, I propose starting with a sub-set of all 
joins, an inner join whose "ON" can only be a set of column names (i.e. no 
expressions).

Suggestion for DOD:

* physical plan to execute the join
* logical plan with the join
* SQL planner with the join
* tests on each of the above

One idea to perform this join in parallel is to, for each RecordBatch in the 
left, perform the join with a record on the right. Another way is to first 
perform a hash by key and sort on both sides, and then perform a 
"SortMergeJoin" on each of the partitions. There may be better ways to achieve 
this, though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9420) Add repartition/shuffle plan

2020-07-12 Thread Jorge (Jira)
Jorge created ARROW-9420:


 Summary: Add repartition/shuffle plan
 Key: ARROW-9420
 URL: https://issues.apache.org/jira/browse/ARROW-9420
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


Some operations (group by, join, over(window.partition_by)) greatly benefit 
from hash partitioning.

This is a proposal to add hash partitioning (based on a expression) to this 
library, so that optimizers can be written to optimize the plan based on the 
required hashing.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9382) Add boolean to valid keys of groupBy

2020-07-08 Thread Jorge (Jira)
Jorge created ARROW-9382:


 Summary: Add boolean to valid keys of groupBy
 Key: ARROW-9382
 URL: https://issues.apache.org/jira/browse/ARROW-9382
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


Currently we do not support boolean columns on groupBy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2020-02-09 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033194#comment-17033194
 ] 

Jorge commented on ARROW-6947:
--

I finished a first draft of this that I am minimally happy with, see 
[https://github.com/jorgecarleitao/arrow/commit/0c77ff531cee191712d08633080de9caed786b62]
 . Not all function signatures are implemented on purpose, as I see we need to 
take a design decision with respect to them.

The API is used as shown in [this 
test|https://github.com/jorgecarleitao/arrow/commit/0c77ff531cee191712d08633080de9caed786b62#diff-8273e76b6910baa123f3a25a967af3b5R572]:
{code:java}
// declare a function
let f = FunctionSignature::UInt64UInt64(Arc::new(Box::new(|a: | Ok(a * 
a;

// register function
ctx.register_function("pow2", f);

let results = collect( ctx, "SELECT c2, pow2(c2) FROM test")?;
{code}
The important aspect here is that `FunctionSignature::UInt64UInt64` is an enum 
variant of the enum FunctionSignature (this is the design choice that we need 
to discuss).

I took this design decision based on the following hypothesis:
 # there are limited use-cases for UDFs with may arguments
 # there are a limited number of different types in arrow (25ish)

under these hypothesis, we can enumerate all variations on the relevant 
`match`. I am not entirely convinced that this is the correct approach, but it 
is the one that is the most aligned with enumerating all types in match 
currently in the code base and does not resort to dynamic typing.

An alternative design is to use std::any::Any. I can also play around with this 
and see how it feels. Assuming that we can do it, I suspect that the trade-off 
is between run-time and compilation time+binary size.

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2020-02-07 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032835#comment-17032835
 ] 

Jorge commented on ARROW-6947:
--

[~andygrove], can I take a shoot at this?

My strategy to tackle this:
 # add support for unary udfs
 # support binary udfs
 # add macro to support arbitrary argument udfs (I do not see an easy way to 
implement variadic generics)

>From what I gathered, we need to
 # Implement a new kernel (ala {{arrow::compute::kernels::arithmetic}}) to 
execute a closure
 # Implement a generic expression that implements {{PhysicalExpr}}
 # Implement a {{FunctionProvider}} and add it to {{ExecutionContext}} (like 
the attribute datasources)
 # bits and pieces for integration (how the )

The design I was planning for the expression:

 
{code:java}
pub struct UnaryFunctionExpr
where
 T: ArrowPrimitiveType, // argument type
 R: ArrowPrimitiveType, // return type
{
 name: String,
/// name of the function
 func:
 Arc arrow::error::Result + Sync + Send + 
'static>,
 arg: Arc,
 arg_type: DataType,
 return_type: DataType,
}{code}
 

for the kernel:
{code:java}
pub fn unary_op(
 op:  Fn(T::Native) -> Result,
 arg: ,
) -> Result>{code}
 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs

2020-02-07 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032835#comment-17032835
 ] 

Jorge edited comment on ARROW-6947 at 2/8/20 6:58 AM:
--

[~andygrove], can I take a shoot at this?

My strategy to tackle this:
 # add support for unary udfs
 # support binary udfs
 # add macro to support arbitrary argument udfs (I do not see an easy way to 
implement variadic generics)

>From what I gathered, we need to
 # Implement a new kernel (ala {{arrow::compute::kernels::arithmetic}}) to 
execute a closure
 # Implement a generic expression that implements {{PhysicalExpr}}
 # Implement a {{FunctionProvider}} and add it to {{ExecutionContext}} (like 
the attribute datasources)
 # bits and pieces for integration

The design I was planning for the expression:

 
{code:java}
pub struct UnaryFunctionExpr
where
 T: ArrowPrimitiveType, // argument type
 R: ArrowPrimitiveType, // return type
{
 name: String,
/// name of the function
 func:
 Arc arrow::error::Result + Sync + Send + 
'static>,
 arg: Arc,
 arg_type: DataType,
 return_type: DataType,
}{code}
 

for the kernel:
{code:java}
pub fn unary_op(
 op:  Fn(T::Native) -> Result,
 arg: ,
) -> Result>{code}
 


was (Author: jorgecarleitao):
[~andygrove], can I take a shoot at this?

My strategy to tackle this:
 # add support for unary udfs
 # support binary udfs
 # add macro to support arbitrary argument udfs (I do not see an easy way to 
implement variadic generics)

>From what I gathered, we need to
 # Implement a new kernel (ala {{arrow::compute::kernels::arithmetic}}) to 
execute a closure
 # Implement a generic expression that implements {{PhysicalExpr}}
 # Implement a {{FunctionProvider}} and add it to {{ExecutionContext}} (like 
the attribute datasources)
 # bits and pieces for integration (how the )

The design I was planning for the expression:

 
{code:java}
pub struct UnaryFunctionExpr
where
 T: ArrowPrimitiveType, // argument type
 R: ArrowPrimitiveType, // return type
{
 name: String,
/// name of the function
 func:
 Arc arrow::error::Result + Sync + Send + 
'static>,
 arg: Arc,
 arg_type: DataType,
 return_type: DataType,
}{code}
 

for the kernel:
{code:java}
pub fn unary_op(
 op:  Fn(T::Native) -> Result,
 arg: ,
) -> Result>{code}
 

> [Rust] [DataFusion] Add support for scalar UDFs
> ---
>
> Key: ARROW-6947
> URL: https://issues.apache.org/jira/browse/ARROW-6947
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>
> As a user, I would like to be able to define my own functions and then use 
> them in SQL statements.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7795) [Rust - DataFusion] Support boolean negation (NOT)

2020-02-07 Thread Jorge (Jira)
Jorge created ARROW-7795:


 Summary: [Rust - DataFusion] Support boolean negation (NOT)
 Key: ARROW-7795
 URL: https://issues.apache.org/jira/browse/ARROW-7795
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


This is a proposal to support the negation operation NOT.

The user should be able to write something like

```SELECT a, b FROM t WHERE NOT a```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-7787) Add collect to Table API

2020-02-06 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-7787:
-
Comment: was deleted

(was: PR: https://github.com/apache/arrow/pull/6375)

> Add collect to Table API
> 
>
> Key: ARROW-7787
> URL: https://issues.apache.org/jira/browse/ARROW-7787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Time Spent: 10m
>  Remaining Estimate: 1h 50m
>
> Currently, executing using the table API requires some effort: given a table 
> `t`:
> {code:java}
> plan = t.to_logical_plan()
> plan = ctx.optimize(plan)
> plan = ctx.create_physical_plan(plan, batch_size)
> result = ctx.collect(plan)
> {code}
> This issue proposes 2 new public methods, one for Table,
> {code:java}
> fn collect(, ctx:  ExecutionContext, batch_size: usize) -> 
> Result>;
> {code}
> and one for ExecutionContext,
> {code:java}
> pub fn collect_plan( self, plan: , batch_size: usize) -> 
> Result>
> {code}
> that optimize, execute and collect the results of the Table/LogicalPlan 
> respectively, in the same spirit of `ExecutionContext.sql`.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7787) Add collect to Table API

2020-02-06 Thread Jorge (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031914#comment-17031914
 ] 

Jorge commented on ARROW-7787:
--

PR: https://github.com/apache/arrow/pull/6375

> Add collect to Table API
> 
>
> Key: ARROW-7787
> URL: https://issues.apache.org/jira/browse/ARROW-7787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Currently, executing using the table API requires some effort: given a table 
> `t`:
> {code:java}
> plan = t.to_logical_plan()
> plan = ctx.optimize(plan)
> plan = ctx.create_physical_plan(plan, batch_size)
> result = ctx.collect(plan)
> {code}
> This issue proposes 2 new public methods, one for Table,
> {code:java}
> fn collect(, ctx:  ExecutionContext, batch_size: usize) -> 
> Result>;
> {code}
> and one for ExecutionContext,
> {code:java}
> pub fn collect_plan( self, plan: , batch_size: usize) -> 
> Result>
> {code}
> that optimize, execute and collect the results of the Table/LogicalPlan 
> respectively, in the same spirit of `ExecutionContext.sql`.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7787) Add collect to Table API

2020-02-06 Thread Jorge (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-7787:
-
Description: 
Currently, executing using the table API requires some effort: given a table 
`t`:
{code:java}
plan = t.to_logical_plan()
plan = ctx.optimize(plan)
plan = ctx.create_physical_plan(plan, batch_size)
result = ctx.collect(plan)
{code}
This issue proposes 2 new public methods, one for Table,
{code:java}
fn collect(, ctx:  ExecutionContext, batch_size: usize) -> 
Result>;
{code}
and one for ExecutionContext,
{code:java}
pub fn collect_plan( self, plan: , batch_size: usize) -> 
Result>
{code}
that optimize, execute and collect the results of the Table/LogicalPlan 
respectively, in the same spirit of `ExecutionContext.sql`.

 

 

 

  was:
Currently, executing using the table API requires some effort: given a table 
`t`:

 
{code:java}
plan = t.to_logical_plan()
plan = ctx.optimize(plan)
plan = ctx.create_physical_plan(plan, batch_size)
result = ctx.collect(plan)
{code}
This issue proposes a 2 new public methods, one for Table,

 
{code:java}
fn collect(, ctx:  ExecutionContext, batch_size: usize) -> 
Result>;
{code}
and one for ExecutionContext,
{code:java}
pub fn collect_plan( self, plan: , batch_size: usize) -> 
Result>
{code}
that optimize, execute and collect the results of the Table/LogicalPlan 
respectively, in the same spirit of `ExecutionContext.sql`.

 

 

 


> Add collect to Table API
> 
>
> Key: ARROW-7787
> URL: https://issues.apache.org/jira/browse/ARROW-7787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Currently, executing using the table API requires some effort: given a table 
> `t`:
> {code:java}
> plan = t.to_logical_plan()
> plan = ctx.optimize(plan)
> plan = ctx.create_physical_plan(plan, batch_size)
> result = ctx.collect(plan)
> {code}
> This issue proposes 2 new public methods, one for Table,
> {code:java}
> fn collect(, ctx:  ExecutionContext, batch_size: usize) -> 
> Result>;
> {code}
> and one for ExecutionContext,
> {code:java}
> pub fn collect_plan( self, plan: , batch_size: usize) -> 
> Result>
> {code}
> that optimize, execute and collect the results of the Table/LogicalPlan 
> respectively, in the same spirit of `ExecutionContext.sql`.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7787) Add collect to Table API

2020-02-06 Thread Jorge (Jira)
Jorge created ARROW-7787:


 Summary: Add collect to Table API
 Key: ARROW-7787
 URL: https://issues.apache.org/jira/browse/ARROW-7787
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Jorge


Currently, executing using the table API requires some effort: given a table 
`t`:

 
{code:java}
plan = t.to_logical_plan()
plan = ctx.optimize(plan)
plan = ctx.create_physical_plan(plan, batch_size)
result = ctx.collect(plan)
{code}
This issue proposes a 2 new public methods, one for Table,

 
{code:java}
fn collect(, ctx:  ExecutionContext, batch_size: usize) -> 
Result>;
{code}
and one for ExecutionContext,
{code:java}
pub fn collect_plan( self, plan: , batch_size: usize) -> 
Result>
{code}
that optimize, execute and collect the results of the Table/LogicalPlan 
respectively, in the same spirit of `ExecutionContext.sql`.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()

2019-05-11 Thread Jorge (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge closed ARROW-5302.

Resolution: Not A Bug

> Memory leak when read_table().to_pandas().to_json()
> ---
>
> Key: ARROW-5302
> URL: https://issues.apache.org/jira/browse/ARROW-5302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>Reporter: Jorge
>Priority: Major
>  Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
> 't': [list(range(0, 180 * 60, 5))] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> table = pq.read_table(path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
> # comment any of the 2 lines for the leak to vanish.
> df = pq.read_table(path).to_pandas()
> df['t'].to_json()
> print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 481676
> 618584
> 755396
> 892156
> 1028892
> 1165660
> 1302428
> 1439184
> 1620376
> 1801340
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  
> Note: it is not entirely obvious that this is caused by pyarrow instead of 
> pandas or numpy. I was only able to reproduce it through write/read from 
> pyarrow.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()

2019-05-11 Thread Jorge (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837905#comment-16837905
 ] 

Jorge commented on ARROW-5302:
--

It seems that this is not related to arrow. The following arrow-independent 
code leaks:
{code:java}
import resource

import pandas as pd

# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
't': [pd.np.array(range(0, 180 * 60, 5))] * batches,
})


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
print(df['t'].iloc[0].shape, df['t'].iloc[0].dtype)
df['t'] = df['t'].apply(lambda x: pd.np.array(list(x)))
df['t'].to_json()
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}

> Memory leak when read_table().to_pandas().to_json()
> ---
>
> Key: ARROW-5302
> URL: https://issues.apache.org/jira/browse/ARROW-5302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>Reporter: Jorge
>Priority: Major
>  Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
> 't': [list(range(0, 180 * 60, 5))] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> table = pq.read_table(path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
> # comment any of the 2 lines for the leak to vanish.
> df = pq.read_table(path).to_pandas()
> df['t'].to_json()
> print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 481676
> 618584
> 755396
> 892156
> 1028892
> 1165660
> 1302428
> 1439184
> 1620376
> 1801340
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  
> Note: it is not entirely obvious that this is caused by pyarrow instead of 
> pandas or numpy. I was only able to reproduce it through write/read from 
> pyarrow.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()

2019-05-11 Thread Jorge (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-5302:
-
Description: 
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
't': [list(range(0, 180 * 60, 5))] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

table = pq.read_table(path)


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df['t'].to_json()
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
481676
618584
755396
892156
1028892
1165660
1302428
1439184
1620376
1801340
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 

Note: it is not entirely obvious that this is caused by pyarrow instead of 
pandas or numpy. I was only able to reproduce it through write/read from 
pyarrow.

 

  was:
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
't': [list(range(0, 180 * 60, 5))] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

table = pq.read_table(path)


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df['t'].to_json()
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
481676
618584
755396
892156
1028892
1165660
1302428
1439184
1620376
1801340
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 


> Memory leak when read_table().to_pandas().to_json()
> ---
>
> Key: ARROW-5302
> URL: https://issues.apache.org/jira/browse/ARROW-5302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>Reporter: Jorge
>Priority: Major
>  Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
> 't': [list(range(0, 180 * 60, 5))] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> table = pq.read_table(path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
> # comment any of the 2 lines for the leak to vanish.
> df = pq.read_table(path).to_pandas()
> df['t'].to_json()
> print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 481676
> 618584
> 755396
> 892156
> 1028892
> 1165660
> 1302428
> 1439184
> 1620376
> 1801340
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  
> Note: it is not entirely obvious that this is caused by pyarrow instead of 
> pandas or numpy. I was only able to reproduce it through write/read from 
> pyarrow.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()

2019-05-11 Thread Jorge (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-5302:
-
Description: 
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
't': [list(range(0, 180 * 60, 5))] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

table = pq.read_table(path)


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df['t'].to_json()
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
481676
618584
755396
892156
1028892
1165660
1302428
1439184
1620376
1801340
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 

  was:
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
'a': ['AA%d' % i for i in range(batches)],
't': [list(range(0, 180 * 60, 5))] * batches,
'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
5))),
'u': ['t'] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df.to_json()
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
785560
1065460
1383532
1607676
1924820
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 


> Memory leak when read_table().to_pandas().to_json()
> ---
>
> Key: ARROW-5302
> URL: https://issues.apache.org/jira/browse/ARROW-5302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>Reporter: Jorge
>Priority: Major
>  Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
> 't': [list(range(0, 180 * 60, 5))] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> table = pq.read_table(path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
> # comment any of the 2 lines for the leak to vanish.
> df = pq.read_table(path).to_pandas()
> df['t'].to_json()
> print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 481676
> 618584
> 755396
> 892156
> 1028892
> 1165660
> 1302428
> 1439184
> 1620376
> 1801340
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()

2019-05-11 Thread Jorge (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-5302:
-
Summary: Memory leak when read_table().to_pandas().to_json()  (was: Memory 
leak when read_table().to_pandas().to_json(orient='records'))

> Memory leak when read_table().to_pandas().to_json()
> ---
>
> Key: ARROW-5302
> URL: https://issues.apache.org/jira/browse/ARROW-5302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>Reporter: Jorge
>Priority: Major
>  Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
> 'a': ['AA%d' % i for i in range(batches)],
> 't': [list(range(0, 180 * 60, 5))] * batches,
> 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
> 5))),
> 'u': ['t'] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
> # comment any of the 2 lines for the leak to vanish.
> df = pq.read_table(path).to_pandas()
> df.to_json(orient='records')
> print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 785560
> 1065460
> 1383532
> 1607676
> 1924820
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()

2019-05-11 Thread Jorge (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge updated ARROW-5302:
-
Description: 
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
'a': ['AA%d' % i for i in range(batches)],
't': [list(range(0, 180 * 60, 5))] * batches,
'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
5))),
'u': ['t'] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df.to_json()
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
785560
1065460
1383532
1607676
1924820
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 

  was:
The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
'a': ['AA%d' % i for i in range(batches)],
't': [list(range(0, 180 * 60, 5))] * batches,
'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
5))),
'u': ['t'] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df.to_json(orient='records')
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
785560
1065460
1383532
1607676
1924820
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 


> Memory leak when read_table().to_pandas().to_json()
> ---
>
> Key: ARROW-5302
> URL: https://issues.apache.org/jira/browse/ARROW-5302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0
> Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
>Reporter: Jorge
>Priority: Major
>  Labels: memory-leak
>
> The following piece of code (running on a Linux, Python 3.6 from anaconda) 
> demonstrates a memory leak when reading data from disk.
> {code:java}
> import resource
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> # some random data, some of them as array columns
> path = 'data.parquet'
> batches = 5000
> df = pd.DataFrame({
> 'a': ['AA%d' % i for i in range(batches)],
> 't': [list(range(0, 180 * 60, 5))] * batches,
> 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
> 5))),
> 'u': ['t'] * batches,
> })
> pq.write_table(pa.Table.from_pandas(df), path)
> # read the data above and convert it to json (e.g. the backend of a restful 
> API)
> for i in range(100):
> # comment any of the 2 lines for the leak to vanish.
> df = pq.read_table(path).to_pandas()
> df.to_json()
> print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
> {code}
> Result :
> {code:java}
> 785560
> 1065460
> 1383532
> 1607676
> 1924820
> ...{code}
> Relevant pip freeze:
> pyarrow (0.13.0)
> pandas (0.24.2)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5302) Memory leak when read_table().to_pandas().to_json(orient='records')

2019-05-11 Thread Jorge (JIRA)
Jorge created ARROW-5302:


 Summary: Memory leak when 
read_table().to_pandas().to_json(orient='records')
 Key: ARROW-5302
 URL: https://issues.apache.org/jira/browse/ARROW-5302
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
 Environment: Linux, Python 3.6.4 :: Anaconda, Inc.
Reporter: Jorge


The following piece of code (running on a Linux, Python 3.6 from anaconda) 
demonstrates a memory leak when reading data from disk.
{code:java}
import resource

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
'a': ['AA%d' % i for i in range(batches)],
't': [list(range(0, 180 * 60, 5))] * batches,
'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 
5))),
'u': ['t'] * batches,
})

pq.write_table(pa.Table.from_pandas(df), path)

# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
# comment any of the 2 lines for the leak to vanish.
df = pq.read_table(path).to_pandas()
df.to_json(orient='records')
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

{code}
Result :
{code:java}
785560
1065460
1383532
1607676
1924820
...{code}
Relevant pip freeze:

pyarrow (0.13.0)

pandas (0.24.2)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)