Re: [VOTE] [RUST] New release process for arrow-rs

2021-05-11 Thread Jorge Cardoso Leitão
+1 Thanks a lot, Andrew! On Wed, May 12, 2021 at 2:04 AM Sutou Kouhei wrote: > +1 > > In > "[VOTE] [RUST] New release process for arrow-rs" on Tue, 11 May 2021 > 18:16:14 -0400, > Andrew Lamb wrote: > > > Per previous discussions, I would like to propose a new release process > for > > ar

Re: [VOTE] [RUST] New release process for arrow-rs

2021-05-11 Thread Sutou Kouhei
+1 In "[VOTE] [RUST] New release process for arrow-rs" on Tue, 11 May 2021 18:16:14 -0400, Andrew Lamb wrote: > Per previous discussions, I would like to propose a new release process for > arrow-rs, releasing officially to crates.io every 2 weeks in addition to > the quarterly release of

Re: [VOTE] [RUST] New release process for arrow-rs

2021-05-11 Thread Andy Grove
+1 (binding) Thanks for driving this, Andrew. The proposal looks great. On Tue, May 11, 2021 at 4:18 PM Adam Lippai wrote: > +1 (non-binding) > > Best regards, > Adam Lippai > > On Wed, May 12, 2021 at 12:16 AM Andrew Lamb wrote: > > > Per previous discussions, I would like to propose a new re

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Andrew Lamb
I see no reason each system that uses Arrow can't add their own notion of sortedness (and potentially distribution, as mentioned by Julian), but given how common the notion was I felt having some sort of standard way to encode the information might make it more useful to the broader Arrow ecosystem

Re: [VOTE] [RUST] New release process for arrow-rs

2021-05-11 Thread Adam Lippai
+1 (non-binding) Best regards, Adam Lippai On Wed, May 12, 2021 at 12:16 AM Andrew Lamb wrote: > Per previous discussions, I would like to propose a new release process for > arrow-rs, releasing officially to crates.io every 2 weeks in addition to > the quarterly release of the other releases.

[VOTE] [RUST] New release process for arrow-rs

2021-05-11 Thread Andrew Lamb
Per previous discussions, I would like to propose a new release process for arrow-rs, releasing officially to crates.io every 2 weeks in addition to the quarterly release of the other releases. The proposal is available as [1] , based on previous discussions [2][3] in the mailing list and comments

Re: [C++] Deciding between "compute function" and "utility function"

2021-05-11 Thread Eduardo Ponce
This is a very good question. I agree with @Antoine and would like to add that the focus of compute functions is to have a public API while utility functions are for internal use. A similar operation to ARROW-12739 are structural transformations [1] such as "list_flatten" [2], which makes use of a

Re: [C++] Deciding between "compute function" and "utility function"

2021-05-11 Thread Antoine Pitrou
Le 11/05/2021 à 22:10, Weston Pace a écrit : How does one decide between "utility function" and "compute function"? For example, https://issues.apache.org/jira/browse/ARROW-12739 is very similar to StructArray::Make which is implemented as a static function. However, 12739 would require poo

[C++] Deciding between "compute function" and "utility function"

2021-05-11 Thread Weston Pace
How does one decide between "utility function" and "compute function"? For example, https://issues.apache.org/jira/browse/ARROW-12739 is very similar to StructArray::Make which is implemented as a static function. However, 12739 would require pool allocation (to concatenate the list items into o

Request for a patch release of arrow 4.x

2021-05-11 Thread Prem Sagar Gali
Hi Arrow Devs, I'm a maintainer from a project called cuDF (https://github.com/rapidsai/cudf.git ), that is based on the Arrow columnar format and depends on the Arrow C++ and Python libraries. We are currently pinned to `1.0.1` and when previously tried upgrading to `3.0.0` we ran into and f

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Adam Hooper
Beware with collations: Collation order is not fixed. As per TR10 : Over time, collation order will vary: there may be fixes needed as more > information becomes available about languages; there may be new government > or industry standards for the language t

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Micah Kornfield
I think in general statistics/sortedness would be useful to have in the Arrow spec (it has come up in the past I think most recently around Min/Max). A few thoughts: 1. We've previously hesitated to specify sort order for different types, we would need to account for that in any formalization. 2

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Julian Hyde
Note that Calcite’s Statistic interface is heavily simplified, designed to be really simple for people to implement when they write their first table adapter. There are more advanced forms of metadata, such as RelMdDistribution [1] and Collation [2]. Since Arrow data sets will typically consist

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Andy Grove
TableProvider has a statistics method already. The approach that Calcite takes is to include sort order as part of statistics [1], so that could be one approach to consider. We may also want to add a method to LogicalPlan for returning the sort order (or statistics) for a particular operator. [1]

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Jorge Cardoso Leitão
So, I think that both cases can be accomplished within DataFusion itself: * When the data is sorted at rest, we can add a method to the TableProvider to share this information with the query engine, like we do with partitioning. * When the data is sorted via some physical node / operation during t

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Andrew Lamb
I was imagining something known at Query Planning time (e.g if the data you are reading in from a parquet file is already sorted by `time` and the query calls for sorting by time, the sort can be omitted). In this case, I was thinking "how would we communicate this information to DataFusion from a

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Andy Grove
I had been planning on adding a method to DataFusion's execution plan to indicate the sort-order of the results (if known), similar to how we currently have information about output partitioning. Would this cover your requirement or are you looking for something outside the context of execution pl

[Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Andrew Lamb
We are building a system that will likely make heavy use of sorted data, and we are trying to figure out how to encode the metadata of "how is this data sorted". We can certainly use our own custom metadata fields, but wanted to check for prior art and gauge community interest in adding something t

[NIGHTLY] Arrow Build Report for Job nightly-2021-05-11-0

2021-05-11 Thread Crossbow
Arrow Build Report for Job nightly-2021-05-11-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-11-0 Failed Tasks: - conda-linux-gcc-py36-arm64: URL: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-11-0-azure-conda-linux-g

Re: [Rust] V2 Proposal for bi-weekly Rust Arrow Releases

2021-05-11 Thread Andrew Lamb
An update here: I think we have reached consensus with the proposal -- though if anyone else has comments, it is not too late! There is one concern about the feasibility of the proposed automation, so I am working on a proof-of-concept showing how it could work. Once that is done, I will formally