[jira] [Created] (ARROW-10129) [Rust] Cargo build is rebuilding dependencies on arrow changes
Jorge created ARROW-10129: - Summary: [Rust] Cargo build is rebuilding dependencies on arrow changes Key: ARROW-10129 URL: https://issues.apache.org/jira/browse/ARROW-10129 Project: Apache Arrow Issue Type: Task Components: Rust Affects Versions: 1.0.1 Reporter: Jorge Fix For: 2.0.0 There is a potential issue in the dependencies causing rust to re-build them on changes of the arrow crate. I was unable to fully grasp what is going on, but this seems to be a re-surface of ARROW-9600 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10128) [Rust] Dictionary-encoding is out of spec
Jorge created ARROW-10128: - Summary: [Rust] Dictionary-encoding is out of spec Key: ARROW-10128 URL: https://issues.apache.org/jira/browse/ARROW-10128 Project: Apache Arrow Issue Type: Task Components: Rust Reporter: Jorge According to [the spec|https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout], every array can be dictionary-encoded, on which its values are encoded by a unique set of values. However, none of our arrays support this encoding and the physical memory layout of this encoding is not being fulfilled. We have a DictionaryArray, but, AFAIK, it does not respect the physical memory layout set out by the spec. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10113) [Rust] Implement conversion of array::Array to ArrowArray
Jorge created ARROW-10113: - Summary: [Rust] Implement conversion of array::Array to ArrowArray Key: ARROW-10113 URL: https://issues.apache.org/jira/browse/ARROW-10113 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10112) [Rust] Implement conversion of ArrowArray to array::Array
Jorge created ARROW-10112: - Summary: [Rust] Implement conversion of ArrowArray to array::Array Key: ARROW-10112 URL: https://issues.apache.org/jira/browse/ARROW-10112 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10111) [Rust] Add basics to test round-trip of consumption to a Rust struct
Jorge created ARROW-10111: - Summary: [Rust] Add basics to test round-trip of consumption to a Rust struct Key: ARROW-10111 URL: https://issues.apache.org/jira/browse/ARROW-10111 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10110) [Rust] Add support to consume C Data Interface
Jorge created ARROW-10110: - Summary: [Rust] Add support to consume C Data Interface Key: ARROW-10110 URL: https://issues.apache.org/jira/browse/ARROW-10110 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Jorge Assignee: Jorge Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10109) [Rust] Add support to produce and consume a C Data interface
Jorge created ARROW-10109: - Summary: [Rust] Add support to produce and consume a C Data interface Key: ARROW-10109 URL: https://issues.apache.org/jira/browse/ARROW-10109 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Jorge Assignee: Jorge Fix For: 3.0.0 The goal of this issue is to support importing C Data arrays into Rust via FFI. The use-case that motivated this issue was the possibility of running DataFusion from Python and support moving arrays from DataFusion to Python/Pyarray and vice-versa. In particular, so that users can write Python UDFs that expect arrow arrays and return arrow arrays, in the same spirit as pandas-udfs in Spark work for Pandas. The brute-force way of writing these arrays is by converting element by element from and to native types. The efficient way of doing it to pass the memory address from and to each implementation, which is zero-copy. To support the latter, we need an FFI implementation in Rust that produces and consumes [C's Data interface|https://arrow.apache.org/docs/format/CDataInterface.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10096) [Rust] [DataFusion] Remove unused code
Jorge created ARROW-10096: - Summary: [Rust] [DataFusion] Remove unused code Key: ARROW-10096 URL: https://issues.apache.org/jira/browse/ARROW-10096 Project: Apache Arrow Issue Type: Task Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge The CsvBatchIterator on datafusion::datasources::csv is unused, and seems to be duplicated code from physical_plan::csv. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10086) [Rust] Migrate min_large_string -> min_string kernels
Jorge created ARROW-10086: - Summary: [Rust] Migrate min_large_string -> min_string kernels Key: ARROW-10086 URL: https://issues.apache.org/jira/browse/ARROW-10086 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge Currently, the kernels are named `min_string`, `max_string`, `max_large_string`, `min_large_string`, but this is no longer needed, as strings are not the same struct -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10084) [Rust] [DataFusion] Add length of large string array
Jorge created ARROW-10084: - Summary: [Rust] [DataFusion] Add length of large string array Key: ARROW-10084 URL: https://issues.apache.org/jira/browse/ARROW-10084 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10065) [Rust] DRY downcasted Arrays
Jorge created ARROW-10065: - Summary: [Rust] DRY downcasted Arrays Key: ARROW-10065 URL: https://issues.apache.org/jira/browse/ARROW-10065 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Assignee: Jorge Currently, we have a significant amount of repeated code to cover slight variations of downcasted arrays: * BinaryArray * LargeBinaryArray * ListArray * StringArray * LargeStringArray The goal of this issue is to clean up repeated code around. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10060) [Rust] MergeExec currently discards partitions with errors
Jorge created ARROW-10060: - Summary: [Rust] MergeExec currently discards partitions with errors Key: ARROW-10060 URL: https://issues.apache.org/jira/browse/ARROW-10060 Project: Apache Arrow Issue Type: Bug Reporter: Jorge Assignee: Jorge IMO they should be accounted for. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10048) [Rust] Error in aggregate of min/max for strings
Jorge created ARROW-10048: - Summary: [Rust] Error in aggregate of min/max for strings Key: ARROW-10048 URL: https://issues.apache.org/jira/browse/ARROW-10048 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge There is an error in computing the min/max of strings with nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10046) [Rust] [DataFusion] Made `*Iterator` implement Iterator
Jorge created ARROW-10046: - Summary: [Rust] [DataFusion] Made `*Iterator` implement Iterator Key: ARROW-10046 URL: https://issues.apache.org/jira/browse/ARROW-10046 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge Currently, we use {{next_batch -> Result>}} to iterate over batches. However, this is very similar to the iterator pattern that Rust offers. This issue concerns migrating our code from {{next() -> Option>}} on the trait Iterator, so that we can leverage the rich Rust iterator API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10044) [Rust] Improve README
Jorge created ARROW-10044: - Summary: [Rust] Improve README Key: ARROW-10044 URL: https://issues.apache.org/jira/browse/ARROW-10044 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10042) [Rust] Buffer equalities are incorrect
Jorge created ARROW-10042: - Summary: [Rust] Buffer equalities are incorrect Key: ARROW-10042 URL: https://issues.apache.org/jira/browse/ARROW-10042 Project: Apache Arrow Issue Type: Bug Affects Versions: 1.0.1 Reporter: Jorge Two (byte) buffers are equal if their contents are equal. However, currently, {{BufferData}} comparison ({{PartialEq}}) uses its {{capacity}} as part of the comparison. Therefore, two slices are considered different even when their content is the same but the corresponding {{BufferData}} has a different capacity. Since this equality is used by {{Buffer}}'s equality, which is used by all our arrays, currently arrays are different when their contents are the same but the underlying capacity is different. I am not sure that this is what we want. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10039) [Rust] Do not required memory alignment of buffers
Jorge created ARROW-10039: - Summary: [Rust] Do not required memory alignment of buffers Key: ARROW-10039 URL: https://issues.apache.org/jira/browse/ARROW-10039 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge Currently, all buffers in rust are created aligned. We also panic when a buffer is not aligned. However, [the spec|https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding] only recommends, not requires, memory alignment. This issue aims at following the spec more closely, by removing the requirement (not panicking) when memory passed to a buffer is not aligned. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10036) [Rust] [DataFusion] Test that the final schema is expected in integration tests
Jorge created ARROW-10036: - Summary: [Rust] [DataFusion] Test that the final schema is expected in integration tests Key: ARROW-10036 URL: https://issues.apache.org/jira/browse/ARROW-10036 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Currently, our integration tests convert a Recordbatch to a string, which we use for testing, but they do not test that the final schema matches our expectations. We should add a test for this, which includes: # field name # field type # field nulability for every field in the schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10030) [Rust] Support fromIter and toIter
Jorge created ARROW-10030: - Summary: [Rust] Support fromIter and toIter Key: ARROW-10030 URL: https://issues.apache.org/jira/browse/ARROW-10030 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing (dump of the proposal:) Rust Arrow supports two main computational models: # Batch Operations, that leverage some form of vectorization # Element-by-element operations, that emerge in more complex operations This document concerns element-by-element operations, that are the most common operations outside of the library. h2. Element-by-element operations These operations are programmatically written as: # Downcast the array to its specific type # Initialize buffers # Iterate over indices and perform the operation, appending to the buffers accordingly # Create ArrayData with the required null bitmap, buffers, childs, etc. # return ArrayRef from ArrayData We can split this process in 3 parts: # Initialization (1 and 2) # Iteration (3) # Finalization (4 and 5) Currently, the API that we offer to our users is: # as_any() to downcast the array based on its DataType # Builders for all types, that users can initialize, matching the downcasted array # Iterate # Use for i in (0..array.len()) # Use Array::value(i) and Array::is_valid(i)/is_null(i)` # use builder.append_value(new_value) or builder.append_null() # Finish the builder and wrap the result in an Arc This API has some issues: # value(i) +is unsafe+, even though it is not marked as such # builders are usually slow due to the checks that they need to perform # The API is not intuitive h2. Proposal This proposal aims at improving this API in 2 specific ways: * Implement IntoIterator Iterator and Iterator> * Implement FromIterator and Item=Option so that users can write: {code:java} let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected); {code} This results in an API that is: # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator # Safe, as it does not allow segfaults # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10028) [Rust] Simplify macro def_numeric_from_vec
Jorge created ARROW-10028: - Summary: [Rust] Simplify macro def_numeric_from_vec Key: ARROW-10028 URL: https://issues.apache.org/jira/browse/ARROW-10028 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge Currently we need to pass too many parameters to it, when they can be inferred. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10020) [Rust] Implement math kernels
Jorge created ARROW-10020: - Summary: [Rust] Implement math kernels Key: ARROW-10020 URL: https://issues.apache.org/jira/browse/ARROW-10020 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Some of these kernels are in DataFusion, and could be migrated to Rust. Note that packed_simd has implementations for some of these operations with SIMD, which we could leverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10019) [Rust] Add substring kernel
Jorge created ARROW-10019: - Summary: [Rust] Add substring kernel Key: ARROW-10019 URL: https://issues.apache.org/jira/browse/ARROW-10019 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Jorge Assignee: Jorge substring returns a substring of a StringArray starting at a given index, and with a given optional length. {{fn substring(array: , start: i32, length: ) -> Result}} This operation is common in strings, and it is useful for string-based transformations -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10016) [Rust] [DataFusion] Implement IsNull and IsNotNull
Jorge created ARROW-10016: - Summary: [Rust] [DataFusion] Implement IsNull and IsNotNull Key: ARROW-10016 URL: https://issues.apache.org/jira/browse/ARROW-10016 Project: Apache Arrow Issue Type: New Feature Components: Rust, Rust - DataFusion Reporter: Jorge Currently, DataFusion has the logical operator `isNull` and `IsNotNull`, but that operator has no physical implementation. Consequently, this operator cannot be used. The goal of this improvement is to add support to this operator on the physical plan. Note that this operator only cares about the null bitmap, and thus should be implementable to all types supported by Arrow. Both operators should return a non-null `BooleanArray`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10015) [Rust] Implement SIMD for aggregate kernel sum
Jorge created ARROW-10015: - Summary: [Rust] Implement SIMD for aggregate kernel sum Key: ARROW-10015 URL: https://issues.apache.org/jira/browse/ARROW-10015 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Currently, our aggregations are made in a simple loop. However, as described [here|https://rust-lang.github.io/packed_simd/perf-guide/vert-hor-ops.html], horizontal operations can also be SIMDed, reports of 2.7x speedups. The goal of this improvement is to support SIMD for the "sum", for primitive types. The code to modify is in [here|https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/aggregate.rs]. A good indication that this issue is completed is when the script {{cargo bench --bench aggregate_kernels && cargo bench --bench aggregate_kernels --features simd}} yields a speed-up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10010) [Rust] Speedup arithmetic
Jorge created ARROW-10010: - Summary: [Rust] Speedup arithmetic Key: ARROW-10010 URL: https://issues.apache.org/jira/browse/ARROW-10010 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge There are some optimizations possible in arithmetics kernels. PR to follow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10001) [Rust] [DataFusion] Add developer guide to README
Jorge created ARROW-10001: - Summary: [Rust] [DataFusion] Add developer guide to README Key: ARROW-10001 URL: https://issues.apache.org/jira/browse/ARROW-10001 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9990) [Rust] [DataFusion] NOT is not plannable
Jorge created ARROW-9990: Summary: [Rust] [DataFusion] NOT is not plannable Key: ARROW-9990 URL: https://issues.apache.org/jira/browse/ARROW-9990 Project: Apache Arrow Issue Type: Bug Reporter: Jorge Assignee: Jorge We have the physical operator, but it is not usable in the logical planning. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9988) [Rust] [DataFusion] Added std::ops to logical expressions
Jorge created ARROW-9988: Summary: [Rust] [DataFusion] Added std::ops to logical expressions Key: ARROW-9988 URL: https://issues.apache.org/jira/browse/ARROW-9988 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge So that we can write {{col("a") + col("b")}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9987) [Rust] [DataFusion] Improve docs of `Expr`.
Jorge created ARROW-9987: Summary: [Rust] [DataFusion] Improve docs of `Expr`. Key: ARROW-9987 URL: https://issues.apache.org/jira/browse/ARROW-9987 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9984) [Rust] [DataFusion] DRY of function to string
Jorge created ARROW-9984: Summary: [Rust] [DataFusion] DRY of function to string Key: ARROW-9984 URL: https://issues.apache.org/jira/browse/ARROW-9984 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9977) [Rust] Add min/max for [Large]String
Jorge created ARROW-9977: Summary: [Rust] Add min/max for [Large]String Key: ARROW-9977 URL: https://issues.apache.org/jira/browse/ARROW-9977 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge Strings are ordered and thus we can apply min/max as other types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9971) [Rust] Speedup take
Jorge created ARROW-9971: Summary: [Rust] Speedup take Key: ARROW-9971 URL: https://issues.apache.org/jira/browse/ARROW-9971 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9966) [Rust] Speedup aggregate kernels
Jorge created ARROW-9966: Summary: [Rust] Speedup aggregate kernels Key: ARROW-9966 URL: https://issues.apache.org/jira/browse/ARROW-9966 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9954) [Rust] [DataFusion] Simplify code of aggregate planning
Jorge created ARROW-9954: Summary: [Rust] [DataFusion] Simplify code of aggregate planning Key: ARROW-9954 URL: https://issues.apache.org/jira/browse/ARROW-9954 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9950) [Rust] [DataFusion] Allow UDF usage without registry
Jorge created ARROW-9950: Summary: [Rust] [DataFusion] Allow UDF usage without registry Key: ARROW-9950 URL: https://issues.apache.org/jira/browse/ARROW-9950 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge This is a functionality relevant only for the DataFrame API. Sometimes a UDF declaration happens during planning, and it makes it very expressive when the user does not have to access the registry at all to plan the UDF. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9937) [Rust] [DataFusion] Average is not correct
Jorge created ARROW-9937: Summary: [Rust] [DataFusion] Average is not correct Key: ARROW-9937 URL: https://issues.apache.org/jira/browse/ARROW-9937 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge The current design of aggregates makes the calculation of the average incorrect. It also makes it impossible to compute the [geometric mean|https://en.wikipedia.org/wiki/Geometric_mean], distinct sum, and other operations. The central issue is that Accumulator returns a `ScalarValue` during partial aggregations via {{get_value}}, but very often a `ScalarValue` is not sufficient information to perform the full aggregation. A simple example is the average of 5 numbers, x1, x2, x3, x4, x5, that are distributed in in batches of 2, {[x1, x2], [x3, x4], [x5]}. Our current calculation performs partial means, {(x1+x2)/2, (x3+x4)/2, x5}, and then reduces them using another average, i.e. {{((x1+x2)/2 + (x3+x4)/2 + x5)/3}} which is not equal to {{(x1 + x2 + x3 + x4 + x5)/5}}. I believe that our Accumulators need to pass more information from the partial aggregations to the final aggregation. We could consider taking an API equivalent to [spark](https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), i.e. have an `update`, a `merge` and an `evaluate`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9922) [Rust] Add `try_from(Vec>)` to StructArray
Jorge created ARROW-9922: Summary: [Rust] Add `try_from(Vec>)` to StructArray Key: ARROW-9922 URL: https://issues.apache.org/jira/browse/ARROW-9922 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9921) [Rust] Add `from(Vec>)` to [Large]StringArray
Jorge created ARROW-9921: Summary: [Rust] Add `from(Vec>)` to [Large]StringArray Key: ARROW-9921 URL: https://issues.apache.org/jira/browse/ARROW-9921 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge and deprecate TryFrom, that currently uses a builder. This should have some performance improvement, and simplifies the interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9919) [Rust] [DataFusion] Math functions
Jorge created ARROW-9919: Summary: [Rust] [DataFusion] Math functions Key: ARROW-9919 URL: https://issues.apache.org/jira/browse/ARROW-9919 Project: Apache Arrow Issue Type: Sub-task Reporter: Jorge Assignee: Jorge See main issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9918) [Rust] [DataFusion] Move away from builders to create arrays
Jorge created ARROW-9918: Summary: [Rust] [DataFusion] Move away from builders to create arrays Key: ARROW-9918 URL: https://issues.apache.org/jira/browse/ARROW-9918 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge [~yordan-pavlov] hinted in https://issues.apache.org/jira/browse/ARROW-8908 that there is a performance hit when using array builders. Indeed, I get about 15% speedup by using {{From}} instead of a builder in DataFusion's math operations, and this number _increases with batch size_, which is detrimental since larger batch sizes leverage SDMI better. Below, {{sqrt_X_Y}}, X is the log2 of the array's size, Y is log2 of the batch size: {code:java} sqrt_20_12 time: [34.422 ms 34.503 ms 34.584 ms] change: [-16.333% -16.055% -15.806%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_12 time: [150.13 ms 150.79 ms 151.42 ms] change: [-16.281% -15.488% -14.779%] (p = 0.00 < 0.05) Performance has improved. sqrt_22_14 time: [151.45 ms 151.68 ms 151.90 ms] change: [-18.233% -16.919% -15.888%] (p = 0.00 < 0.05) Performance has improved. {code} See how there is no difference between {{sqrt_20_12}} and {{sqrt_22_12}} (same batch size), but a measurable difference between {{sqrt_22_12}} and {{sqrt_22_14}} (different batch size). This issue proposes that we migrate our datafusion and rust operations from builder-base array construction to a non-builder based, whenever possible. Most code can be easily migrated to use {{From}} trait, a migration is along the lines of {code:java} let mut builder = Float64Builder::new(array.len()); for i in 0..array.len() { if array.is_null(i) { builder.append_null()?; } else { builder.append_value(array.value(i).$FUNC())?; } } Ok(Arc::new(builder.finish())) {code} to {code:java} let result: Float64Array = (0..array.len()) .map(|i| { if array.is_null(i) { None } else { Some(array.value(i).$FUNC()) } }) .collect::>>() .into(); Ok(Arc::new(result)){code} and this is even simpler as we do not need to use an extra API (Builder). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9910) [Rust] [DataFusion] Type coercion of Variadic is wrong
Jorge created ARROW-9910: Summary: [Rust] [DataFusion] Type coercion of Variadic is wrong Key: ARROW-9910 URL: https://issues.apache.org/jira/browse/ARROW-9910 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge Specifically, the case {code:java} // common type is u64 case( vec![DataType::UInt32, DataType::UInt64], Signature::Variadic(vec![DataType::UInt32, DataType::UInt64]), vec![DataType::UInt64, DataType::UInt64], )?, {code} fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9902) [Rust] [DataFusion] Add support for array()
Jorge created ARROW-9902: Summary: [Rust] [DataFusion] Add support for array() Key: ARROW-9902 URL: https://issues.apache.org/jira/browse/ARROW-9902 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge `array` is a function that takes an arbitrary number of columns and returns a fixed-size list with their values. This function is notoriously difficult to implement because it receives an arbitrary number of arguments or arbitrary but common types, but it is also useful for e.g. time-series data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9892) [Rust] [DataFusion] Add support for concat
Jorge created ARROW-9892: Summary: [Rust] [DataFusion] Add support for concat Key: ARROW-9892 URL: https://issues.apache.org/jira/browse/ARROW-9892 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge So that we can concatenate strings together. {{pub fn concat(args: Vec) -> Expr}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9891) [Rust] [DataFusion] Make math functions support f32
Jorge created ARROW-9891: Summary: [Rust] [DataFusion] Make math functions support f32 Key: ARROW-9891 URL: https://issues.apache.org/jira/browse/ARROW-9891 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge Given a math function `g`, we compute g(f32) using g(cast(f32 AS f64)). The goal of this issue is to make the operation be cast(g(f32) AS f64) instead. Since computations on f32 are faster than on f64, this is a simple optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9887) [Rust] [DataFusion] Add support for complex return types of built-in functions
Jorge created ARROW-9887: Summary: [Rust] [DataFusion] Add support for complex return types of built-in functions Key: ARROW-9887 URL: https://issues.apache.org/jira/browse/ARROW-9887 Project: Apache Arrow Issue Type: Improvement Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9886) [Rust] [DataFusion] Simplify code to test cast
Jorge created ARROW-9886: Summary: [Rust] [DataFusion] Simplify code to test cast Key: ARROW-9886 URL: https://issues.apache.org/jira/browse/ARROW-9886 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge We have 3 tests with similar functionality, but that only vary on the types they test. Let's create a macro to apply to all of them, so that the tests are equivalent and DRY. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9885) Simplify type coercion for binary types
Jorge created ARROW-9885: Summary: Simplify type coercion for binary types Key: ARROW-9885 URL: https://issues.apache.org/jira/browse/ARROW-9885 Project: Apache Arrow Issue Type: Task Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge The function `numerical_coercion` only uses the operator `op` for its error formatting. But the function's intent can be simply generalized to "coerce two types to numerically equivalent types". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9849) [Rust] [DataFusion] Make UDFs not need a Field
Jorge created ARROW-9849: Summary: [Rust] [DataFusion] Make UDFs not need a Field Key: ARROW-9849 URL: https://issues.apache.org/jira/browse/ARROW-9849 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge [https://github.com/apache/arrow/pull/7967,] shows that it is possible to not require users to pass a `Field` to UDFs declarations and instead just pass a `DataType`. Let's deprecate Field from them, and instead just use `DataType`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9836) [Rust] [DataFusion] Improve API for usage of UDFs
Jorge created ARROW-9836: Summary: [Rust] [DataFusion] Improve API for usage of UDFs Key: ARROW-9836 URL: https://issues.apache.org/jira/browse/ARROW-9836 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge TL;DR; currently, users call UDFs through {color:#00}df.select(scalar_functions(“sqrt”, vec![col(“a”)], DataType::Float64)){color} Proposal: {color:#00}let udf = df.registry()?;{color} {color:#00}df.select(udf(“sqrt”, vec![col(“a”)])?){color} so that they do not have to remember the UDFs return type when using it. This API will in the future allow to declare the UDF as part of the planning, like spark, instead of having to register it in the registry before using it (we just need to check if the UDF is registered or not before doing so). See complete proposal here: [https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9835) [Rust] [DataFusion] Remove FunctionMeta
Jorge created ARROW-9835: Summary: [Rust] [DataFusion] Remove FunctionMeta Key: ARROW-9835 URL: https://issues.apache.org/jira/browse/ARROW-9835 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Currently, our code works as follows: 1. Declare a UDF via {{udf::ScalarFunction}} 2. Register the UDF 3. call {{scalar_function("name", vec![...], type)}} to use it during planning. However, during planning, we: 1. Get the ScalarFunction by name 2. convert it to FunctionMetadata 3. get the ScalarFunction associated with FunctionMetadata's name I.e. {{FunctionMetadata}} does not seem to be needed. Goal: remove {{FunctionMetadata}} and just pass {{udf::ScalarFunction}} directly from the registry to the physical plan. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9831) [Rust] [DataFusion] Fix compilation error
Jorge created ARROW-9831: Summary: [Rust] [DataFusion] Fix compilation error Key: ARROW-9831 URL: https://issues.apache.org/jira/browse/ARROW-9831 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9830) [Rust] [DataFusion] Verify that projection push down does not remove aliases columns
Jorge created ARROW-9830: Summary: [Rust] [DataFusion] Verify that projection push down does not remove aliases columns Key: ARROW-9830 URL: https://issues.apache.org/jira/browse/ARROW-9830 Project: Apache Arrow Issue Type: Test Components: Rust, Rust - DataFusion Reporter: Jorge See comments: https://github.com/apache/arrow/pull/7919#issuecomment-678662753 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9815) [Rust] [DataFusion] Deadlock in creation of physical plan with two udfs
Jorge created ARROW-9815: Summary: [Rust] [DataFusion] Deadlock in creation of physical plan with two udfs Key: ARROW-9815 URL: https://issues.apache.org/jira/browse/ARROW-9815 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge This one took me some time to understand, but I finally have a reproducible example: when two udfs are called, one after the other, we cause a deadlock when creating the physical plan. Example test {code} #[test] fn csv_query_sqrt_sqrt() -> Result<()> { let mut ctx = create_ctx()?; register_aggregate_csv( ctx)?; let sql = "SELECT sqrt(sqrt(c12)) FROM aggregate_test_100 LIMIT 1"; let actual = execute( ctx, sql); // sqrt(sqrt(c12=0.9294097332465232)) = 0.9818650561397431 let expected = "0.9818650561397431".to_string(); assert_eq!(actual.join("\n"), expected); Ok(()) } {code} I believe that this is due to the recursive nature of the physical planner, that locks scalar_functions within a match, which blocks the whole thing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9809) [Rust] [DataFusion] logical schema = physical schema is not true
Jorge created ARROW-9809: Summary: [Rust] [DataFusion] logical schema = physical schema is not true Key: ARROW-9809 URL: https://issues.apache.org/jira/browse/ARROW-9809 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge In tests/sql.rs, we test that the physical and the optimized schema must match. However, this is not necessarily true for all our queries. An example: {code:java} #[test] fn csv_query_sum_cast() { let mut ctx = ExecutionContext::new(); register_aggregate_csv_by_sql( ctx); // c8 = i32; c9 = i64 let sql = "SELECT c8 + c9 FROM aggregate_test_100"; // check that the physical and logical schemas are equal execute( ctx, sql); } {code} The physical expression (and schema) of this operation, after optimization, is {{CAST(c8 as Int64) Plus c9}} (this test fails). AFAIK, the invariant of the optimizer is that the output types and nullability are the same. Also, note that the reason the optimized logical schema equals the logical schema is that our type coercer does not change the output names of the schema, even though it re-writes logical expressions. I.e. after the optimization, `.to_field()` of an expression may no longer match the field name nor type in the Plan's schema. IMO this is currently by (implicit?) design, as we do not want our logical schema's column names to change during optimizations, or all column references may point to non-existent columns. This is something that brought up on the mailing list about polymorphism. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9793) [Rust] [DataFusion] Tests failing in master
Jorge created ARROW-9793: Summary: [Rust] [DataFusion] Tests failing in master Key: ARROW-9793 URL: https://issues.apache.org/jira/browse/ARROW-9793 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9788) Handle naming inconsistencies between SQL, DataFrame API and struct names
Jorge created ARROW-9788: Summary: Handle naming inconsistencies between SQL, DataFrame API and struct names Key: ARROW-9788 URL: https://issues.apache.org/jira/browse/ARROW-9788 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Assignee: Andy Grove Currently, we have naming inconsistencies between the different APIs that make it a bit confusing. The typical example atm is `df.where().to_plan?.explain()` shows a "Selection" in the plan when "Selection" in SQL and many other query languages is a projection, not a filter. Other examples: ``` name: Selection SQL: WHERE DF: filter ``` ``` name: Aggregation SQL: GROUP BY DF: aggregate ``` ``` name: Projection SQL: SELECT DF: select,select_columns ``` I suggest that we align them with a common notation, preferably aligned with other more common query languages. I am assigning this to you [~andygrove] as you are probably the only person that can take a decision on this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9779) [Rust] [DataFusion] Increase stability of average accumulator
Jorge created ARROW-9779: Summary: [Rust] [DataFusion] Increase stability of average accumulator Key: ARROW-9779 URL: https://issues.apache.org/jira/browse/ARROW-9779 Project: Apache Arrow Issue Type: Improvement Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge Currently, our method to compute the average is based on: 1. compute sum of all terms 2. compute count of all terms 3. compute sum / count however, the sum may overflow. There is a typical solution to this based on an online formula described e.g. [here|http://www.heikohoffmann.de/htmlthesis/node134.html] to keep the numbers small. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9778) [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests
Jorge created ARROW-9778: Summary: [Rust] [DataFusion] Logical and physical schemas' nullability does not match in 8 out of 20 end-to-end tests Key: ARROW-9778 URL: https://issues.apache.org/jira/browse/ARROW-9778 Project: Apache Arrow Issue Type: Bug Components: Rust, Rust - DataFusion Reporter: Jorge In `tests/sql.rs`, if we re-write the ```execute``` function to test the end schemas, as ``` /// Execute query and return result set as tab delimited string fn execute(ctx: ExecutionContext, sql: ) -> Vec { let plan = ctx.create_logical_plan().unwrap(); let plan = ctx.optimize().unwrap(); let physical_plan = ctx.create_physical_plan().unwrap(); let results = ctx.collect(physical_plan.as_ref()).unwrap(); if results.len() > 0 { // results must match the logical schema assert_eq!(plan.schema().as_ref(), results[0].schema().as_ref()); } result_str() } ``` we end up with 8 tests failing, which indicates that our physical and logical plans are not aligned. In all cases, the issue is nullability: our logical plan assumes nullability = true, while our physical plan may change the nullability field. If we do not plan to track nullability on the logical level, we could consider replacing Schema by a type that does not track nullability. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9756) [Rust] [DataFusion] Add support for generic return types of scalar UDFs
Jorge created ARROW-9756: Summary: [Rust] [DataFusion] Add support for generic return types of scalar UDFs Key: ARROW-9756 URL: https://issues.apache.org/jira/browse/ARROW-9756 Project: Apache Arrow Issue Type: New Feature Reporter: Jorge Assignee: Jorge Currently, we have math functions declared as UDFs that while they can be evaluated against float32 and float64, their return type is fixed (float64). This issue is about extending the UDF interface to support generic return types (a function), so that developers (us included), can register UDFs or variable return types. There is a small overhead in this, since evaluating types now requires a function call, however, IMO this is still very small when compared to anything, as we are talking about plans. We can always apply some caching if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9752) [Rist] [DataFusion] Add support for Aggregate UDFs
Jorge created ARROW-9752: Summary: [Rist] [DataFusion] Add support for Aggregate UDFs Key: ARROW-9752 URL: https://issues.apache.org/jira/browse/ARROW-9752 Project: Apache Arrow Issue Type: New Feature Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge This will allow to more easily extend the existing offering of aggregate functions. The existing functions shall be migrated to this interface. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9751) [Rust] [DataFusion] Extend UDFs to accept more than one type per argument
Jorge created ARROW-9751: Summary: [Rust] [DataFusion] Extend UDFs to accept more than one type per argument Key: ARROW-9751 URL: https://issues.apache.org/jira/browse/ARROW-9751 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9678) [Rust] [DataFusion] Improve projection push down to remove unused columns
Jorge created ARROW-9678: Summary: [Rust] [DataFusion] Improve projection push down to remove unused columns Key: ARROW-9678 URL: https://issues.apache.org/jira/browse/ARROW-9678 Project: Apache Arrow Issue Type: New Feature Components: Rust, Rust - DataFusion Reporter: Jorge Assignee: Jorge Currently, the projection push down only removes columns that are never referenced in the plan. However, sometimes a projection declares columns that themselves are never used. This issue is about improving the projection push-down to remove any column that is not logically required by the plan. Failing unit-test with the idea: {code:java} #[test] fn table_unused_column() -> Result<()> { let table_scan = test_table_scan()?; assert_eq!(3, table_scan.schema().fields().len()); assert_fields_eq(_scan, vec!["a", "b", "c"]); // we never use "b" in the first projection => remove it let plan = LogicalPlanBuilder::from(_scan) .project(vec![col("c"), col("a"), col("b")])? .filter(col("c").gt((1)))? .project(vec![col("c"), col("a")])? .build()?; assert_fields_eq(, vec!["c", "a"]); let expected = "\ Projection: #c, #a\ \n Selection: #c Gt Int32(1)\ \nProjection: #c, #a\ \n TableScan: test projection=Some([0, 2])"; assert_optimized_plan_eq(, expected); Ok(()) } {code} This issue was firstly identified by [~andygrove] [here|https://github.com/ballista-compute/ballista/issues/320]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9619) [Rust] [DataFusion] Add predicate push-down
Jorge created ARROW-9619: Summary: [Rust] [DataFusion] Add predicate push-down Key: ARROW-9619 URL: https://issues.apache.org/jira/browse/ARROW-9619 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge Like the title says, add an optimizer to push filters down the plan as farther as logically possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9618) [Rust] [DataFusion] Make it easier to write optimizers
Jorge created ARROW-9618: Summary: [Rust] [DataFusion] Make it easier to write optimizers Key: ARROW-9618 URL: https://issues.apache.org/jira/browse/ARROW-9618 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge Currently, it is a bit painful to write optimizers as we need to cover all branches of a match to of logical plans. There could be some utility functions to remove boilerplate code around these. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9617) [Rust] [DataFusion] Add length of string array
Jorge created ARROW-9617: Summary: [Rust] [DataFusion] Add length of string array Key: ARROW-9617 URL: https://issues.apache.org/jira/browse/ARROW-9617 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge With ARROW-9615, we can port the operator to DataFusion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9615) Add kernel to compute length of string array
Jorge created ARROW-9615: Summary: Add kernel to compute length of string array Key: ARROW-9615 URL: https://issues.apache.org/jira/browse/ARROW-9615 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9559) [Rust] [DataFusion] Revert privatization of exprlist_to_fields
Jorge created ARROW-9559: Summary: [Rust] [DataFusion] Revert privatization of exprlist_to_fields Key: ARROW-9559 URL: https://issues.apache.org/jira/browse/ARROW-9559 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge This function was privatized by mistake by cd503c3f583dab. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9555) Add inner (hash) join physical plan
Jorge created ARROW-9555: Summary: Add inner (hash) join physical plan Key: ARROW-9555 URL: https://issues.apache.org/jira/browse/ARROW-9555 Project: Apache Arrow Issue Type: Sub-task Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9549) [Rust] Parquet no longer builds
Jorge created ARROW-9549: Summary: [Rust] Parquet no longer builds Key: ARROW-9549 URL: https://issues.apache.org/jira/browse/ARROW-9549 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Andy Grove I believe that there is a typo in {{rust/parquet/Cargo.toml}}: it reads {{arrow = { path = "../arrow", version = "1.0.0-SNAPSHOT", optional = true }}}, but it should read {{arrow = { path = "../arrow", version = "1.1.0-SNAPSHOT", optional = true }}}, or the project does not compile. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9527) [Rust] Remove un-needed dev-dependencies
Jorge created ARROW-9527: Summary: [Rust] Remove un-needed dev-dependencies Key: ARROW-9527 URL: https://issues.apache.org/jira/browse/ARROW-9527 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge We currently have some dev-dependencies that are not needed. Let's remove them -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9520) [Rust] [DataFusion] Can't alias an aggregate expression
Jorge created ARROW-9520: Summary: [Rust] [DataFusion] Can't alias an aggregate expression Key: ARROW-9520 URL: https://issues.apache.org/jira/browse/ARROW-9520 Project: Apache Arrow Issue Type: Bug Components: Rust - DataFusion Reporter: Jorge The following test (on execute) fails: {code} #[test] fn aggregate_with_alias() -> Result<()> { let results = execute("SELECT c1, COUNT(c2) AS count FROM test GROUP BY c1", 4)?; assert_eq!(field_names(batch), vec!["c1", "count"]); let expected = vec!["0,10", "1,10", "2,10", "3,10"]; let mut rows = test::format_batch(); rows.sort(); assert_eq!(rows, expected); Ok(()) } {code} The root cause is that, in {{sql::planner}}, we interpret {{COUNT(c2) AS count}} as An {{Expr::Alias}}, which fails the {{is_aggregate_expr}} condition, thus being interpreted as grouped expression instead of an aggregated expression. This raises the Error {{General("Projection references non-aggregate values")}} The planner could interpret the statement above as two steps: an aggregation followed by a projection. Alternatively, we can allow aliases to be valid aggregation expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9519) [Rust] Improve error message when getting a field by name from schema
Jorge created ARROW-9519: Summary: [Rust] Improve error message when getting a field by name from schema Key: ARROW-9519 URL: https://issues.apache.org/jira/browse/ARROW-9519 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Jorge Assignee: Jorge Currently, when a name that does not exist on the Schema is passed to {{Schema::index_of}}, the error message is just the name itself. This makes it a bit difficult to debug. I propose that we change the that error message to something like {{Unable to get field named "nickname". Valid fields: ["first_name", "last_name", "address"]}} to make it easier to understand. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9516) [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes
Jorge created ARROW-9516: Summary: [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes Key: ARROW-9516 URL: https://issues.apache.org/jira/browse/ARROW-9516 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge This issue covers three main topics that IMO are addressed as a whole in a refactor of the physical plans and expressions in data fusion. The underlying issues that justify this particular ticket: h3. We currently assign poor names to the output schema. Specifically, most names are given based on the last expression's name. Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the fields names "c, SUM, SUM". h3. We currently derive the column names from physical expressions, not logical expressions This implies that logical expressions that perform multiple operations (e.g. an grouped aggregation that performs partitioned aggregations + merge + final aggregation) have their name derived from their physical declaration, not logical. IMO a physical plan is an execution plan and is thus not concerned with naming. It is the logical plan that should be concerned with naming. Conceptually, a given logical plan can have more than one physical plan, e.g. depending on the execution environment (e.g. locally vs distributed). h3. We currently carry the index of a column read throughout the plans, making it cumbersome to write optimizers. More details here. In summary, it is possible to remove one of the optimizers and significantly simplify the other if columns do not carry indexing information. h2. Proposal I propose that we: h3. drop {{physical_plan::expressions::Column::index}} This is a major simplification of the code, and allow us to just ignore the position of the statement on the schema, and instead focus on its name. This is overall a simplification because it allow us to treat columns based solely on their names, and not on their position in the schema. Since SQL does not care about the position of the column on the table anyway (we currently already take the first column with that name), this seems natural. I already prototyped this [here|https://github.com/jorgecarleitao/arrow/tree/column_names]. The main conclusion of this prototype is that this feasible as long as all our expressions get assigned a unique name, which is against what we currently offer (see example above). This leads me to: h3. drop {{physical_plan::PhysicalExpr::name()}} Currently, the name of an expression is derived from its physical plan. However, some operations' names are required to be known before its physical representation. The example I found in our current code is the grouped aggregation described above. If we were to build the name of our aggregation based on its physical plan, the name of a "COUNT(a)" operation would be {{SUM(COUNT(a))}} because, in the physical plan we first count on each partition, then merge, and them sum the counts over all partitions. Fundamentally, IMO the issue here is that we are mixing responsibilities: the physical plan should not care about naming, because the physical plan corresponds to an execution plan, not a logical description of the column (its name). This leads me to: h3. add {{logicalplan::Expr::name()}} This will contain the name of this expression, that will naturally depend on the variation. Its implementation will be based on our current code for {{physical_plan::PhysicalExpr::name()}}. I can take this work, but before committing, would like to know your thoughts about this. My initial prototyping indicate that all of this is possible and greatly simplifies the code, but I may be missing a design aspect of this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9502) [Python][C++] Date64 converted to Date32 on parquet
Jorge created ARROW-9502: Summary: [Python][C++] Date64 converted to Date32 on parquet Key: ARROW-9502 URL: https://issues.apache.org/jira/browse/ARROW-9502 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Jorge Executing the example below, {code:python} import datetime import pyarrow as pa import pyarrow.parquet data = [ datetime.datetime(2000, 1, 1, 12, 34, 56, 123456), datetime.datetime(2000, 1, 1) ] data32 = pa.array(data, type='date32') data64 = pa.array(data, type='date64') table = pyarrow.Table.from_arrays([data32, data64], names=['a', 'b']) pyarrow.parquet.write_table(table, 'a.parquet') print(table) print() print(pyarrow.parquet.read_table('a.parquet')) {code} yields {code:java} pyarrow.Table a: date32[day] b: date64[ms] pyarrow.Table a: date32[day] b: date32[day] <--- IMO it should be date64[ms] {code} indicating that pyarrow converted its date64[ms] schema to date32[day]. I used the rust crate to print parquet's metadata, and the value is indeed stored as i32, which suggests that this likely happens on the writer, not reader. IMO this does not have any practical implication because they are both dates and a 32 bit date (in days) can hold more dates than a 64 bit date in milliseconds, but still constitutes an error as the roundtrip serialization does not yield the same schema. A broader question I have is why data64 exists in the first place? I can't see any reason to store a *date* in milliseconds since EPOCH. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9461) [Rust] Reading Date32 and Date64 errors - they are incorrectly converted to RecordBatch
Jorge created ARROW-9461: Summary: [Rust] Reading Date32 and Date64 errors - they are incorrectly converted to RecordBatch Key: ARROW-9461 URL: https://issues.apache.org/jira/browse/ARROW-9461 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Jorge Assignee: Jorge Steps to reproduce: 1. Create a file `a.parquet` using the following code: {code:python} import pyarrow.parquet import numpy def _data_datetime(f): data = numpy.array([ numpy.datetime64('2018-08-18 23:25'), numpy.datetime64('2019-08-18 23:25'), numpy.datetime64("NaT") ]) data = numpy.array(data, dtype=f'datetime64[{f}]') return data def _write_parquet(path, data): table = pyarrow.Table.from_arrays([pyarrow.array(data)], names=['a']) pyarrow.parquet.write_table(table, path) return path _write_parquet('a.parquet', _data_datetime('D')) {code} 2. Write a small example to read it to RecordBatches 3. observe the error {{ArrowError(ParquetError("InvalidArgumentError(\"column types must match schema types, expected Date32(Day) but found UInt32 at column index 0\")"))}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9447) [Rust][DataFusion] Allow closures as ScalarUDFs
Jorge created ARROW-9447: Summary: [Rust][DataFusion] Allow closures as ScalarUDFs Key: ARROW-9447 URL: https://issues.apache.org/jira/browse/ARROW-9447 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Jorge Assignee: Jorge Currently the ScalarUDF requires to be a function. However, most applications would prefer to declare a UDF as a closure. Let's add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9427) [Rust][DataFusion] Add pub fn ExecutionContext.tables()
Jorge created ARROW-9427: Summary: [Rust][DataFusion] Add pub fn ExecutionContext.tables() Key: ARROW-9427 URL: https://issues.apache.org/jira/browse/ARROW-9427 Project: Apache Arrow Issue Type: New Feature Components: Rust - DataFusion Reporter: Jorge This allows users to know what names can be passed to {{table()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9425) [Rust][DataFusion] Make ExecutionContext sharable between threads
Jorge created ARROW-9425: Summary: [Rust][DataFusion] Make ExecutionContext sharable between threads Key: ARROW-9425 URL: https://issues.apache.org/jira/browse/ARROW-9425 Project: Apache Arrow Issue Type: Task Components: Rust - DataFusion Reporter: Jorge I have been playing with pyo3 to ship and call this library directly from Python, and the current ExecutionContext can't be used within Python with threads due to the usage of `Box` on `datasources`. This issue likely affects integration of DataFusion with other libraries. I propose that we allow ExecutionContext to be sharable between threads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9423) Add join
Jorge created ARROW-9423: Summary: Add join Key: ARROW-9423 URL: https://issues.apache.org/jira/browse/ARROW-9423 Project: Apache Arrow Issue Type: Task Components: Rust - DataFusion Reporter: Jorge A major operation in analytics is the join. This issue concerns adding the join operation. Given the complexity of this task, I propose starting with a sub-set of all joins, an inner join whose "ON" can only be a set of column names (i.e. no expressions). Suggestion for DOD: * physical plan to execute the join * logical plan with the join * SQL planner with the join * tests on each of the above One idea to perform this join in parallel is to, for each RecordBatch in the left, perform the join with a record on the right. Another way is to first perform a hash by key and sort on both sides, and then perform a "SortMergeJoin" on each of the partitions. There may be better ways to achieve this, though. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9420) Add repartition/shuffle plan
Jorge created ARROW-9420: Summary: Add repartition/shuffle plan Key: ARROW-9420 URL: https://issues.apache.org/jira/browse/ARROW-9420 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Some operations (group by, join, over(window.partition_by)) greatly benefit from hash partitioning. This is a proposal to add hash partitioning (based on a expression) to this library, so that optimizers can be written to optimize the plan based on the required hashing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9382) Add boolean to valid keys of groupBy
Jorge created ARROW-9382: Summary: Add boolean to valid keys of groupBy Key: ARROW-9382 URL: https://issues.apache.org/jira/browse/ARROW-9382 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Currently we do not support boolean columns on groupBy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033194#comment-17033194 ] Jorge commented on ARROW-6947: -- I finished a first draft of this that I am minimally happy with, see [https://github.com/jorgecarleitao/arrow/commit/0c77ff531cee191712d08633080de9caed786b62] . Not all function signatures are implemented on purpose, as I see we need to take a design decision with respect to them. The API is used as shown in [this test|https://github.com/jorgecarleitao/arrow/commit/0c77ff531cee191712d08633080de9caed786b62#diff-8273e76b6910baa123f3a25a967af3b5R572]: {code:java} // declare a function let f = FunctionSignature::UInt64UInt64(Arc::new(Box::new(|a: | Ok(a * a; // register function ctx.register_function("pow2", f); let results = collect( ctx, "SELECT c2, pow2(c2) FROM test")?; {code} The important aspect here is that `FunctionSignature::UInt64UInt64` is an enum variant of the enum FunctionSignature (this is the design choice that we need to discuss). I took this design decision based on the following hypothesis: # there are limited use-cases for UDFs with may arguments # there are a limited number of different types in arrow (25ish) under these hypothesis, we can enumerate all variations on the relevant `match`. I am not entirely convinced that this is the correct approach, but it is the one that is the most aligned with enumerating all types in match currently in the code base and does not resort to dynamic typing. An alternative design is to use std::any::Any. I can also play around with this and see how it feels. Assuming that we can do it, I suspect that the trade-off is between run-time and compilation time+binary size. > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032835#comment-17032835 ] Jorge commented on ARROW-6947: -- [~andygrove], can I take a shoot at this? My strategy to tackle this: # add support for unary udfs # support binary udfs # add macro to support arbitrary argument udfs (I do not see an easy way to implement variadic generics) >From what I gathered, we need to # Implement a new kernel (ala {{arrow::compute::kernels::arithmetic}}) to execute a closure # Implement a generic expression that implements {{PhysicalExpr}} # Implement a {{FunctionProvider}} and add it to {{ExecutionContext}} (like the attribute datasources) # bits and pieces for integration (how the ) The design I was planning for the expression: {code:java} pub struct UnaryFunctionExpr where T: ArrowPrimitiveType, // argument type R: ArrowPrimitiveType, // return type { name: String, /// name of the function func: Arc arrow::error::Result + Sync + Send + 'static>, arg: Arc, arg_type: DataType, return_type: DataType, }{code} for the kernel: {code:java} pub fn unary_op( op: Fn(T::Native) -> Result, arg: , ) -> Result>{code} > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6947) [Rust] [DataFusion] Add support for scalar UDFs
[ https://issues.apache.org/jira/browse/ARROW-6947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032835#comment-17032835 ] Jorge edited comment on ARROW-6947 at 2/8/20 6:58 AM: -- [~andygrove], can I take a shoot at this? My strategy to tackle this: # add support for unary udfs # support binary udfs # add macro to support arbitrary argument udfs (I do not see an easy way to implement variadic generics) >From what I gathered, we need to # Implement a new kernel (ala {{arrow::compute::kernels::arithmetic}}) to execute a closure # Implement a generic expression that implements {{PhysicalExpr}} # Implement a {{FunctionProvider}} and add it to {{ExecutionContext}} (like the attribute datasources) # bits and pieces for integration The design I was planning for the expression: {code:java} pub struct UnaryFunctionExpr where T: ArrowPrimitiveType, // argument type R: ArrowPrimitiveType, // return type { name: String, /// name of the function func: Arc arrow::error::Result + Sync + Send + 'static>, arg: Arc, arg_type: DataType, return_type: DataType, }{code} for the kernel: {code:java} pub fn unary_op( op: Fn(T::Native) -> Result, arg: , ) -> Result>{code} was (Author: jorgecarleitao): [~andygrove], can I take a shoot at this? My strategy to tackle this: # add support for unary udfs # support binary udfs # add macro to support arbitrary argument udfs (I do not see an easy way to implement variadic generics) >From what I gathered, we need to # Implement a new kernel (ala {{arrow::compute::kernels::arithmetic}}) to execute a closure # Implement a generic expression that implements {{PhysicalExpr}} # Implement a {{FunctionProvider}} and add it to {{ExecutionContext}} (like the attribute datasources) # bits and pieces for integration (how the ) The design I was planning for the expression: {code:java} pub struct UnaryFunctionExpr where T: ArrowPrimitiveType, // argument type R: ArrowPrimitiveType, // return type { name: String, /// name of the function func: Arc arrow::error::Result + Sync + Send + 'static>, arg: Arc, arg_type: DataType, return_type: DataType, }{code} for the kernel: {code:java} pub fn unary_op( op: Fn(T::Native) -> Result, arg: , ) -> Result>{code} > [Rust] [DataFusion] Add support for scalar UDFs > --- > > Key: ARROW-6947 > URL: https://issues.apache.org/jira/browse/ARROW-6947 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > As a user, I would like to be able to define my own functions and then use > them in SQL statements. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7795) [Rust - DataFusion] Support boolean negation (NOT)
Jorge created ARROW-7795: Summary: [Rust - DataFusion] Support boolean negation (NOT) Key: ARROW-7795 URL: https://issues.apache.org/jira/browse/ARROW-7795 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge This is a proposal to support the negation operation NOT. The user should be able to write something like ```SELECT a, b FROM t WHERE NOT a``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-7787) Add collect to Table API
[ https://issues.apache.org/jira/browse/ARROW-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-7787: - Comment: was deleted (was: PR: https://github.com/apache/arrow/pull/6375) > Add collect to Table API > > > Key: ARROW-7787 > URL: https://issues.apache.org/jira/browse/ARROW-7787 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Jorge >Priority: Major > Labels: pull-request-available > Original Estimate: 2h > Time Spent: 10m > Remaining Estimate: 1h 50m > > Currently, executing using the table API requires some effort: given a table > `t`: > {code:java} > plan = t.to_logical_plan() > plan = ctx.optimize(plan) > plan = ctx.create_physical_plan(plan, batch_size) > result = ctx.collect(plan) > {code} > This issue proposes 2 new public methods, one for Table, > {code:java} > fn collect(, ctx: ExecutionContext, batch_size: usize) -> > Result>; > {code} > and one for ExecutionContext, > {code:java} > pub fn collect_plan( self, plan: , batch_size: usize) -> > Result> > {code} > that optimize, execute and collect the results of the Table/LogicalPlan > respectively, in the same spirit of `ExecutionContext.sql`. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7787) Add collect to Table API
[ https://issues.apache.org/jira/browse/ARROW-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031914#comment-17031914 ] Jorge commented on ARROW-7787: -- PR: https://github.com/apache/arrow/pull/6375 > Add collect to Table API > > > Key: ARROW-7787 > URL: https://issues.apache.org/jira/browse/ARROW-7787 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Jorge >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Currently, executing using the table API requires some effort: given a table > `t`: > {code:java} > plan = t.to_logical_plan() > plan = ctx.optimize(plan) > plan = ctx.create_physical_plan(plan, batch_size) > result = ctx.collect(plan) > {code} > This issue proposes 2 new public methods, one for Table, > {code:java} > fn collect(, ctx: ExecutionContext, batch_size: usize) -> > Result>; > {code} > and one for ExecutionContext, > {code:java} > pub fn collect_plan( self, plan: , batch_size: usize) -> > Result> > {code} > that optimize, execute and collect the results of the Table/LogicalPlan > respectively, in the same spirit of `ExecutionContext.sql`. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7787) Add collect to Table API
[ https://issues.apache.org/jira/browse/ARROW-7787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-7787: - Description: Currently, executing using the table API requires some effort: given a table `t`: {code:java} plan = t.to_logical_plan() plan = ctx.optimize(plan) plan = ctx.create_physical_plan(plan, batch_size) result = ctx.collect(plan) {code} This issue proposes 2 new public methods, one for Table, {code:java} fn collect(, ctx: ExecutionContext, batch_size: usize) -> Result>; {code} and one for ExecutionContext, {code:java} pub fn collect_plan( self, plan: , batch_size: usize) -> Result> {code} that optimize, execute and collect the results of the Table/LogicalPlan respectively, in the same spirit of `ExecutionContext.sql`. was: Currently, executing using the table API requires some effort: given a table `t`: {code:java} plan = t.to_logical_plan() plan = ctx.optimize(plan) plan = ctx.create_physical_plan(plan, batch_size) result = ctx.collect(plan) {code} This issue proposes a 2 new public methods, one for Table, {code:java} fn collect(, ctx: ExecutionContext, batch_size: usize) -> Result>; {code} and one for ExecutionContext, {code:java} pub fn collect_plan( self, plan: , batch_size: usize) -> Result> {code} that optimize, execute and collect the results of the Table/LogicalPlan respectively, in the same spirit of `ExecutionContext.sql`. > Add collect to Table API > > > Key: ARROW-7787 > URL: https://issues.apache.org/jira/browse/ARROW-7787 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion >Reporter: Jorge >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Currently, executing using the table API requires some effort: given a table > `t`: > {code:java} > plan = t.to_logical_plan() > plan = ctx.optimize(plan) > plan = ctx.create_physical_plan(plan, batch_size) > result = ctx.collect(plan) > {code} > This issue proposes 2 new public methods, one for Table, > {code:java} > fn collect(, ctx: ExecutionContext, batch_size: usize) -> > Result>; > {code} > and one for ExecutionContext, > {code:java} > pub fn collect_plan( self, plan: , batch_size: usize) -> > Result> > {code} > that optimize, execute and collect the results of the Table/LogicalPlan > respectively, in the same spirit of `ExecutionContext.sql`. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7787) Add collect to Table API
Jorge created ARROW-7787: Summary: Add collect to Table API Key: ARROW-7787 URL: https://issues.apache.org/jira/browse/ARROW-7787 Project: Apache Arrow Issue Type: Improvement Components: Rust - DataFusion Reporter: Jorge Currently, executing using the table API requires some effort: given a table `t`: {code:java} plan = t.to_logical_plan() plan = ctx.optimize(plan) plan = ctx.create_physical_plan(plan, batch_size) result = ctx.collect(plan) {code} This issue proposes a 2 new public methods, one for Table, {code:java} fn collect(, ctx: ExecutionContext, batch_size: usize) -> Result>; {code} and one for ExecutionContext, {code:java} pub fn collect_plan( self, plan: , batch_size: usize) -> Result> {code} that optimize, execute and collect the results of the Table/LogicalPlan respectively, in the same spirit of `ExecutionContext.sql`. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()
[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge closed ARROW-5302. Resolution: Not A Bug > Memory leak when read_table().to_pandas().to_json() > --- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. >Reporter: Jorge >Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 't': [list(range(0, 180 * 60, 5))] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > table = pq.read_table(path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df['t'].to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 481676 > 618584 > 755396 > 892156 > 1028892 > 1165660 > 1302428 > 1439184 > 1620376 > 1801340 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > > Note: it is not entirely obvious that this is caused by pyarrow instead of > pandas or numpy. I was only able to reproduce it through write/read from > pyarrow. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()
[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837905#comment-16837905 ] Jorge commented on ARROW-5302: -- It seems that this is not related to arrow. The following arrow-independent code leaks: {code:java} import resource import pandas as pd # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 't': [pd.np.array(range(0, 180 * 60, 5))] * batches, }) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. print(df['t'].iloc[0].shape, df['t'].iloc[0].dtype) df['t'] = df['t'].apply(lambda x: pd.np.array(list(x))) df['t'].to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} > Memory leak when read_table().to_pandas().to_json() > --- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. >Reporter: Jorge >Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 't': [list(range(0, 180 * 60, 5))] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > table = pq.read_table(path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df['t'].to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 481676 > 618584 > 755396 > 892156 > 1028892 > 1165660 > 1302428 > 1439184 > 1620376 > 1801340 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > > Note: it is not entirely obvious that this is caused by pyarrow instead of > pandas or numpy. I was only able to reproduce it through write/read from > pyarrow. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()
[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-5302: - Description: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 't': [list(range(0, 180 * 60, 5))] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) table = pq.read_table(path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df['t'].to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 481676 618584 755396 892156 1028892 1165660 1302428 1439184 1620376 1801340 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) Note: it is not entirely obvious that this is caused by pyarrow instead of pandas or numpy. I was only able to reproduce it through write/read from pyarrow. was: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 't': [list(range(0, 180 * 60, 5))] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) table = pq.read_table(path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df['t'].to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 481676 618584 755396 892156 1028892 1165660 1302428 1439184 1620376 1801340 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) > Memory leak when read_table().to_pandas().to_json() > --- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. >Reporter: Jorge >Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 't': [list(range(0, 180 * 60, 5))] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > table = pq.read_table(path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df['t'].to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 481676 > 618584 > 755396 > 892156 > 1028892 > 1165660 > 1302428 > 1439184 > 1620376 > 1801340 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > > Note: it is not entirely obvious that this is caused by pyarrow instead of > pandas or numpy. I was only able to reproduce it through write/read from > pyarrow. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()
[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-5302: - Description: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 't': [list(range(0, 180 * 60, 5))] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) table = pq.read_table(path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df['t'].to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 481676 618584 755396 892156 1028892 1165660 1302428 1439184 1620376 1801340 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) was: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) > Memory leak when read_table().to_pandas().to_json() > --- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. >Reporter: Jorge >Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 't': [list(range(0, 180 * 60, 5))] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > table = pq.read_table(path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df['t'].to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 481676 > 618584 > 755396 > 892156 > 1028892 > 1165660 > 1302428 > 1439184 > 1620376 > 1801340 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()
[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-5302: - Summary: Memory leak when read_table().to_pandas().to_json() (was: Memory leak when read_table().to_pandas().to_json(orient='records')) > Memory leak when read_table().to_pandas().to_json() > --- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. >Reporter: Jorge >Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 'a': ['AA%d' % i for i in range(batches)], > 't': [list(range(0, 180 * 60, 5))] * batches, > 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // > 5))), > 'u': ['t'] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df.to_json(orient='records') > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 785560 > 1065460 > 1383532 > 1607676 > 1924820 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5302) Memory leak when read_table().to_pandas().to_json()
[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge updated ARROW-5302: - Description: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) was: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json(orient='records') print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) > Memory leak when read_table().to_pandas().to_json() > --- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. >Reporter: Jorge >Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 'a': ['AA%d' % i for i in range(batches)], > 't': [list(range(0, 180 * 60, 5))] * batches, > 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // > 5))), > 'u': ['t'] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df.to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 785560 > 1065460 > 1383532 > 1607676 > 1924820 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5302) Memory leak when read_table().to_pandas().to_json(orient='records')
Jorge created ARROW-5302: Summary: Memory leak when read_table().to_pandas().to_json(orient='records') Key: ARROW-5302 URL: https://issues.apache.org/jira/browse/ARROW-5302 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Environment: Linux, Python 3.6.4 :: Anaconda, Inc. Reporter: Jorge The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json(orient='records') print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) -- This message was sent by Atlassian JIRA (v7.6.3#76005)