Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-21 Thread Felipe Oliveira Carvalho
I marked the C++ implementation PR ready for review today and will soon be working on the Go implementation. https://github.com/apache/arrow/pull/35345 Note that differently from Velox's ArrayVector, the Arrow implementation (ListView) also features a 64-bit version (LargeListView) to be

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-21 Thread Andrew Lamb
I am convinced that the benefits of the ArrayViews as discussed on this thread, despite the inconvenience of two similar formats (ListArray and ArrayView) and equivalent formats, are enough to add it to the Arrow spec. It is my opinion we should add ArrowView to the Arrow format, and really push

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-17 Thread Pedro Eugenio Rocha Pedreira
Hi all, Getting back to this thread as I realize there were a few unanswered questions. Adding a bit more context on the rationale and usage of ArrayViews in Velox, and the importance to standardize it: re: Why do we need it? We use ArrayViews for two main reasons. First, for efficient

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Jacob Wujciak-Jens
> Even if ListView is rarely used for interoperability (if it never gains wide adoption), some of the arrow implementations could use ListView to offer faster computation kernels, which I think has real value This is an important point, thanks for the clear phrasing Andrew! On Thu, Jun 15, 2023

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Felipe Oliveira Carvalho
On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies wrote: > Even something relatively straightforward becomes a huge implementation > effort when multiplied by a large number of codebases, users and > datasets. Parquet is a great source of historical examples of the > challenges of

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Andrew Lamb
;>>>> > > > > > >>>>>>>>> I think Sasha brings up a good point, that the advantages > > of > > > > > >> this > > > > > >>>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Felipe Oliveira Carvalho
transferring this type > > > > >> without > > > > >>>>>>>>> conversion would be ideal. One use case I can think of is > if > > > > >> Velox > > > > >>>>>&

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
compressed data > >> > >> directly > >> > >> without an expansion > >> step whenever > >> possible. This is > >> why having it as > >> part of the open > >> Arrow format is so > >> > >> important: > >>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Antoine Pitrou
---------- From: Felipe Oliveira Carvalho Sent: Friday, May 19, 2023 10:01 AM To: dev@arrow.apache.org Cc: Pedro Eugenio Rocha Pedreira Subject: Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format +pedroerp On Thu,

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Antoine Pitrou
eraged just yet. [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf Best, -- Pedro Pedreira ____________ From: Felipe Oliveira Carvalho Sent: Friday, May 19, 2023 10:01 AM To: dev@arrow.apache.org Cc: Pedro Eugenio Rocha Pedreira Subject: Re: [DISCUSS][Format] Starting

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Andrew Lamb
w variants for all three? > > > >>>>>>>>> > > > >>>>>>>>> Best, > > > >>>>>>>>> > > > >>>>>>>>> Will Jones > > > >>>>>>>>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Felipe Oliveira Carvalho
; > >> can > > >>>>> be a > > >>>>>>>>>> provisory step for compatibility between systems that don’t > > >>>>>> understand > > >>>>>>> the > > >>>>>>>>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
gt;>>>> important: > >>>>>>>>>> everyone can agree on a format that’s friendly to parallel > >> and/or > >>>>>>>>>> vectorized compute kernels without introducing multiple > >>>>> incompatible > >>>>>>&g

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Antoine Pitrou
, 2023 10:01 AM To: dev@arrow.apache.org Cc: Pedro Eugenio Rocha Pedreira Subject: Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davies taylordavies@ googlemail. com. invalid> wro

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Raphael Taylor-Davies
Hi All, I might be missing something, but rather than opening the can of worms of alternative layouts, etc... perhaps we could support this use-case as a canonical extension type over dictionary encoded, variable-sized arrays. I'll try to explain my reasoning below, but the major advantage

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-13 Thread Will Jones
gt; > compression > > > > > > > schemes > > > > > > > > > like > > > > > > > > > >>> run-end encoding — the goal is processing the > compressed > > data > > > > > > > >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
> > > > > > > > >>> > > > > > > > > >>>> I don't feel like this representation is necessarily a > > > detail of > > > > > > the > > > > > > > > >>> query > > >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Ian Cook
>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky < > > > > > > > >>> krassovskysa...@gmail.com > > > > > > > >>>> > > > > > > > > > > wrote: > > > > > > &

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Felipe Oliveira Carvalho
gt; > > > >>>> like it would be very cheap (though I understand not > necessarily > > > > the > > > > > > >>> other > > > > > > >>>> way around, but you’d need

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
> >>>> points, and performing a conversion from the non-view to > view > > > > format > > > > > > >>> seems > > > > > > >>>> like it would be very cheap (though I understand not > necessarily > > > > the > > > > > > >>&

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Ian Cook
> defined > > > > > >>>>> our only tensor extension type to be built on a fixed size > > list. > > > > If a > > > > > >>> use > > > > > >>>>> case of this might be manipulating tensors with zero

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-27 Thread Micah Kornfield
; > > > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira > > > > >>>>>> wrote: > > > > >>>>>> > > > > >>>>>> Hi all, > > > > >>>>>> > >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Weston Pace
>>> out-or-order, regardless of their types or encodings. This is > > > >>> naturally > > > >>>>>> doable for all primitive types (fixed-size), but not for types > > that > > > >>>> don’t > > > >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Will Jones
t;> generate a bitmap containing which rows take the THEN and which > take > > >>> the > > >>>>>> ELSE branch. Then you populate all rows that match the first > branch > > >>> by > > >>>>>> evaluating the THEN expr

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Andrew Lamb
gt;> out-of-order, you would either have a big branch per row dispatching > >>> to > >>>> the > >>>>>> right expression (slow), or populate two distinct vectors then > >>> merging > >>>> them > >>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Antoine Pitrou
___ From: Felipe Oliveira Carvalho Sent: Friday, May 19, 2023 10:01 AM To: dev@arrow.apache.org Cc: Pedro Eugenio Rocha Pedreira Subject: Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format +pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davi

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Will Jones
t;> flexibility >> > to >> > >> implement cardinality increasing/reducing operations, but we don’t >> use >> > it >> > >> for that purpose. Operations like filtering, joining, unnesting and >> > similar >> > >> are done by wrapping th

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Will Jones
data types with any encoding. > > There > > >> are more details on Section 4.2.1 in [1] > > >> > > >> Beyond this, it also gives function/kernel developers more flexibility > > to > > >> implement operations that manipulate Arrays/Maps. For

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Felipe Oliveira Carvalho
of substr(), trim(), and similar). One nice last property is that > >> this layout allows for overlapping ranges. This is something discussed > with > >> our ML people to allow deduping feature values in a tensor (which is > fairly > >> common), but not some

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Aldrin
tensor (which is fairly>> common), but not something we have leveraged just yet.>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf>>>> Best,>> -->> Pedro Pedreira>> >> From: Felipe Oliveira Carvalho >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Sasha Krassovsky
>> this layout allows for overlapping ranges. This is something discussed with >> our ML people to allow deduping feature values in a tensor (which is fairly >> common), but not something we have leveraged just yet. >> >> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf >> >> Best,

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Will Jones
reira.pdf > > Best, > -- > Pedro Pedreira > ________ > From: Felipe Oliveira Carvalho > Sent: Friday, May 19, 2023 10:01 AM > To: dev@arrow.apache.org > Cc: Pedro Eugenio Rocha Pedreira > Subject: Re: [DISCUSS][Format] Starting the draft impl

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Pedro Eugenio Rocha Pedreira
] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf Best, -- Pedro Pedreira From: Felipe Oliveira Carvalho Sent: Friday, May 19, 2023 10:01 AM To: dev@arrow.apache.org Cc: Pedro Eugenio Rocha Pedreira Subject: Re: [DISCUSS][Format] Starting the draft implementation

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Felipe Oliveira Carvalho
+pedroerp On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies wrote: > Hi All, > > > if we added this, do we think many Arrow and query > > engine implementations (for example, DataFusion) will be eager to add > full > > support for the type, including compute kernels? Or are they likely to >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Ian Cook
That's great, thanks Brent. If possible could you share a specific example of the operation you are referring to so that we can better reason about how the ListView layout would help in this case? Any additional input from the community providing specifics of real-world workloads that are

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-15 Thread Brent Gardner
For what it's worth, my company is building a database using arrow(rs) as an in memory storage format, and this feature would be very helpful because it would allow us to bitmask out mvcc rows that have been deleted / have not yet been committed / have been rolled back, etc. - Brent On Mon, May

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-15 Thread Ian Cook
I think it would be easier for us all to weigh the costs and benefits of adding this proposed ListView layout to the Arrow specification and implementing it in the various Arrow libraries if we could all see some benchmarks demonstrating the performance/efficiency benefits compared to Arrow’s

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-13 Thread Andrew Lamb
I agree that it is hard to see any compelling advantage of adopting ListView that would incentivize adding it to DataFusion. It also seems like the conversion requires changing only indexes (not the underlying data) so it would likely be relatively inexpensive I would think On Thu, May 11, 2023

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Raphael Taylor-Davies
Hi All, if we added this, do we think many Arrow and query engine implementations (for example, DataFusion) will be eager to add full support for the type, including compute kernels? Or are they likely to just convert this type to ListArray at import boundaries? I can't speak for query engines

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Will Jones
Hi Felipe, Thanks for the additional details. > Velox kernels benefit from being able to append data to the array from > different threads without care for strict ordering. Only the offsets array > has to be written according to logical order but that is potentially a much > smaller buffer than

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Felipe Oliveira Carvalho
Initial reason for ListView arrays in Arrow is zero-copy compatibility with Velox which uses this format. Velox kernels benefit from being able to append data to the array from different threads without care for strict ordering. Only the offsets array has to be written according to logical order

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-27 Thread Andrew Lamb
My apologies, I did not see the thread [1] for some reason [1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb On Thu, Apr 27, 2023 at 10:32 AM Andrew Lamb wrote: > Felipe, thank you for bringing this up. > > Another approach that is sometimes used in database engines (like

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-27 Thread Andrew Lamb
Felipe, thank you for bringing this up. Another approach that is sometimes used in database engines (like DuckDB) and is often called selection vectors, is to store another bitmask that says which elements in the array should be "selected" and which are ignored and functions like a view. For

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Micah Kornfield
Small bikeshed: But to keep naming consistent "ViewList"? On Wed, Apr 26, 2023 at 8:02 AM Weston Pace wrote: > > My understanding is that the primary benefit of this ListView layout > > over Arrow's existing List layouts [1] is that ListView allows for > > buffer alignment [2] without padding,

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Weston Pace
> My understanding is that the primary benefit of this ListView layout > over Arrow's existing List layouts [1] is that ListView allows for > buffer alignment [2] without padding, which makes vectorized > processing much more efficient. Is this understanding correct? Yes. Though proponents of

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Felipe Oliveira Carvalho
After Weston's suggestion above, I've renamed files and classes in my WIP implementation: ArrayView -> ListView On Wed, Apr 26, 2023 at 11:08 AM Ian Cook wrote: > +1 to what Weston and Joris suggested regarding the name. "ListView" > seems like the best name to use for this layout in Arrow. >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Ian Cook
+1 to what Weston and Joris suggested regarding the name. "ListView" seems like the best name to use for this layout in Arrow. My understanding is that the primary benefit of this ListView layout over Arrow's existing List layouts [1] is that ListView allows for buffer alignment [2] without

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Joris Van den Bossche
On Wed, 26 Apr 2023 at 02:37, Weston Pace wrote: > > For context, there was some discussion on this back in [1]. At that time > this was called "sequence view" but I do not like that name. However, > array-view array is a little confusing. Given this is similar to list can > we go with

RE: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread wish maple
I think the ArrayVector can have benefits above: 1. Converting a Batch in Velox or other system to arrow array could be much more lightweight. 2. Modifying, filter and copy array or string could be much more lightweight Velox can make a Vector mutable, seems that arrow array cannot. Seems it

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Will Jones
I suppose one common use case is materializing list columns after some expanding operation like a join or unnest. That's a case where I could imagine a lot of repetition of values. Haven't yet thought of common cases where there is overlap but not full duplication, but am eager to hear any. The

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Raphael Taylor-Davies
Unless I am missing something, I think the selection use-case could be equally well served by a dictionary-encoded BinarArray/ListArray, and would have the benefit of not requiring any modifications to the existing format or kernels. The major additional flexibility of the proposed encoding

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread David Li
Is there a need for a 64-bit offsets version the same way we have List and LargeList? And just to be clear, the difference with List is that the lists don't have to be stored in their logical order (or in other words, offsets do not have to be nondecreasing and so we also need sizes)? On Wed,

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Weston Pace
For context, there was some discussion on this back in [1]. At that time this was called "sequence view" but I do not like that name. However, array-view array is a little confusing. Given this is similar to list can we go with list-view array? > Thanks for the introduction. I'd be interested

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Will Jones
Hi Felipe, Thanks for the introduction. I'd be interested to hear about the applications Velox has found for these vectors, and in what situations they are useful. This could be contrasted with the current ListArray implementations. IIUC it would be fairly cheap to transform a ListArray to an

[DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Felipe Oliveira Carvalho
Hi folks, I would like to start a public discussion on the inclusion of a new array format to Arrow — array-view array. The name is also up for debate. This format is inspired by Velox's ArrayVector format [1]. Logically, this array represents an array of arrays. Each element is an array-view