I marked the C++ implementation PR ready for review today and will soon be
working on the Go implementation.
https://github.com/apache/arrow/pull/35345
Note that differently from Velox's ArrayVector, the Arrow implementation
(ListView) also features a 64-bit version (LargeListView) to be
I am convinced that the benefits of the ArrayViews as discussed on this
thread, despite the inconvenience of two similar formats (ListArray and
ArrayView) and equivalent formats, are enough to add it to the Arrow spec.
It is my opinion we should add ArrowView to the Arrow format, and really
push
Hi all,
Getting back to this thread as I realize there were a few unanswered questions.
Adding a bit more context on the rationale and usage of ArrayViews in Velox,
and the importance to standardize it:
re: Why do we need it?
We use ArrayViews for two main reasons. First, for efficient
> Even if ListView is rarely used for interoperability (if it never gains
wide adoption), some of the arrow implementations could use ListView to
offer faster computation kernels, which I think has real value
This is an important point, thanks for the clear phrasing Andrew!
On Thu, Jun 15, 2023
On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies
wrote:
> Even something relatively straightforward becomes a huge implementation
> effort when multiplied by a large number of codebases, users and
> datasets. Parquet is a great source of historical examples of the
> challenges of
;>>>>
> > > > > >>>>>>>>> I think Sasha brings up a good point, that the advantages
> > of
> > > > > >> this
> > > > > >>>
transferring this type
> > > > >> without
> > > > >>>>>>>>> conversion would be ideal. One use case I can think of is
> if
> > > > >> Velox
> > > > >>>>>&
compressed data
> >>
> >> directly
> >>
> >> without an expansion
> >> step whenever
> >> possible. This is
> >> why having it as
> >> part of the open
> >> Arrow format is so
> >>
> >> important:
> >>
----------
From:
Felipe
Oliveira
Carvalho
Sent:
Friday,
May 19,
2023
10:01 AM
To:
dev@arrow.apache.org
Cc:
Pedro
Eugenio
Rocha
Pedreira
Subject:
Re:
[DISCUSS][Format]
Starting
the draft
implementation
of
the
ArrayView
array
format
+pedroerp
On Thu,
eraged just yet.
[1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
Best,
--
Pedro Pedreira
____________
From: Felipe Oliveira Carvalho
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org
Cc: Pedro Eugenio Rocha Pedreira
Subject: Re: [DISCUSS][Format] Starting
w variants for all three?
> > > >>>>>>>>>
> > > >>>>>>>>> Best,
> > > >>>>>>>>>
> > > >>>>>>>>> Will Jones
> > > >>>>>>>>
; > >> can
> > >>>>> be a
> > >>>>>>>>>> provisory step for compatibility between systems that don’t
> > >>>>>> understand
> > >>>>>>> the
> > >>>>>>>>
gt;>>>> important:
> >>>>>>>>>> everyone can agree on a format that’s friendly to parallel
> >> and/or
> >>>>>>>>>> vectorized compute kernels without introducing multiple
> >>>>> incompatible
> >>>>>>&g
, 2023 10:01 AM
To: dev@arrow.apache.org
Cc: Pedro Eugenio Rocha Pedreira
Subject: Re: [DISCUSS][Format] Starting the draft
implementation
of
the
ArrayView array format
+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael
Taylor-Davies
taylordavies@ googlemail. com. invalid> wro
Hi All,
I might be missing something, but rather than opening the can of worms
of alternative layouts, etc... perhaps we could support this use-case as
a canonical extension type over dictionary encoded, variable-sized
arrays. I'll try to explain my reasoning below, but the major advantage
gt; > compression
> > > > > > > schemes
> > > > > > > > > like
> > > > > > > > > >>> run-end encoding — the goal is processing the
> compressed
> > data
> > > > > > > >
> > > > > > > > >>>
> > > > > > > > >>>> I don't feel like this representation is necessarily a
> > > detail of
> > > > > > the
> > > > > > > > >>> query
> > >
>>>> On Sat, May 20, 2023 at 15:00, Sasha Krassovsky <
> > > > > > > >>> krassovskysa...@gmail.com
> > > > > > > >>>>
> > >
> > > > > > > wrote:
> > > > > > &
gt; > > > >>>> like it would be very cheap (though I understand not
> necessarily
> > > > the
> > > > > > >>> other
> > > > > > >>>> way around, but you’d need
> >>>> points, and performing a conversion from the non-view to
> view
> > > > format
> > > > > > >>> seems
> > > > > > >>>> like it would be very cheap (though I understand not
> necessarily
> > > > the
> > > > > > >>&
> defined
> > > > > >>>>> our only tensor extension type to be built on a fixed size
> > list.
> > > > If a
> > > > > >>> use
> > > > > >>>>> case of this might be manipulating tensors with zero
;
> > > > >>>>>> On Fri, May 19, 2023 at 1:59 PM Pedro Eugenio Rocha Pedreira
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>> Hi all,
> > > > >>>>>>
> >
>>> out-or-order, regardless of their types or encodings. This is
> > > >>> naturally
> > > >>>>>> doable for all primitive types (fixed-size), but not for types
> > that
> > > >>>> don’t
> > > >
t;> generate a bitmap containing which rows take the THEN and which
> take
> > >>> the
> > >>>>>> ELSE branch. Then you populate all rows that match the first
> branch
> > >>> by
> > >>>>>> evaluating the THEN expr
gt;> out-of-order, you would either have a big branch per row dispatching
> >>> to
> >>>> the
> >>>>>> right expression (slow), or populate two distinct vectors then
> >>> merging
> >>>> them
> >>
___
From: Felipe Oliveira Carvalho
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org
Cc: Pedro Eugenio Rocha Pedreira
Subject: Re: [DISCUSS][Format] Starting the draft implementation of
the
ArrayView array format
+pedroerp On Thu, 11 May 2023 at 17: 51 Raphael Taylor-Davi
t;> flexibility
>> > to
>> > >> implement cardinality increasing/reducing operations, but we don’t
>> use
>> > it
>> > >> for that purpose. Operations like filtering, joining, unnesting and
>> > similar
>> > >> are done by wrapping th
data types with any encoding.
> > There
> > >> are more details on Section 4.2.1 in [1]
> > >>
> > >> Beyond this, it also gives function/kernel developers more flexibility
> > to
> > >> implement operations that manipulate Arrays/Maps. For
of substr(), trim(), and similar). One nice last property is that
> >> this layout allows for overlapping ranges. This is something discussed
> with
> >> our ML people to allow deduping feature values in a tensor (which is
> fairly
> >> common), but not some
tensor (which is fairly>> common), but not something we have leveraged just yet.>>>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf>>>> Best,>> -->> Pedro Pedreira>> >> From: Felipe Oliveira Carvalho >
>> this layout allows for overlapping ranges. This is something discussed with
>> our ML people to allow deduping feature values in a tensor (which is fairly
>> common), but not something we have leveraged just yet.
>>
>> [1] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
>>
>> Best,
reira.pdf
>
> Best,
> --
> Pedro Pedreira
> ________
> From: Felipe Oliveira Carvalho
> Sent: Friday, May 19, 2023 10:01 AM
> To: dev@arrow.apache.org
> Cc: Pedro Eugenio Rocha Pedreira
> Subject: Re: [DISCUSS][Format] Starting the draft impl
] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
Best,
--
Pedro Pedreira
From: Felipe Oliveira Carvalho
Sent: Friday, May 19, 2023 10:01 AM
To: dev@arrow.apache.org
Cc: Pedro Eugenio Rocha Pedreira
Subject: Re: [DISCUSS][Format] Starting the draft implementation
+pedroerp
On Thu, 11 May 2023 at 17:51 Raphael Taylor-Davies
wrote:
> Hi All,
>
> > if we added this, do we think many Arrow and query
> > engine implementations (for example, DataFusion) will be eager to add
> full
> > support for the type, including compute kernels? Or are they likely to
>
That's great, thanks Brent. If possible could you share a specific
example of the operation you are referring to so that we can better
reason about how the ListView layout would help in this case?
Any additional input from the community providing specifics of
real-world workloads that are
For what it's worth, my company is building a database using arrow(rs) as
an in memory storage format, and this feature would be very helpful because
it would allow us to bitmask out mvcc rows that have been deleted / have
not yet been committed / have been rolled back, etc.
- Brent
On Mon, May
I think it would be easier for us all to weigh the costs and benefits
of adding this proposed ListView layout to the Arrow specification and
implementing it in the various Arrow libraries if we could all see
some benchmarks demonstrating the performance/efficiency benefits
compared to Arrow’s
I agree that it is hard to see any compelling advantage of adopting
ListView that would incentivize adding it to DataFusion.
It also seems like the conversion requires changing only indexes (not the
underlying data) so it would likely be relatively inexpensive I would think
On Thu, May 11, 2023
Hi All,
if we added this, do we think many Arrow and query
engine implementations (for example, DataFusion) will be eager to add full
support for the type, including compute kernels? Or are they likely to just
convert this type to ListArray at import boundaries?
I can't speak for query engines
Hi Felipe,
Thanks for the additional details.
> Velox kernels benefit from being able to append data to the array from
> different threads without care for strict ordering. Only the offsets array
> has to be written according to logical order but that is potentially a much
> smaller buffer than
Initial reason for ListView arrays in Arrow is zero-copy compatibility with
Velox which uses this format.
Velox kernels benefit from being able to append data to the array from
different threads without care for strict ordering. Only the offsets array
has to be written according to logical order
My apologies, I did not see the thread [1] for some reason
[1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
On Thu, Apr 27, 2023 at 10:32 AM Andrew Lamb wrote:
> Felipe, thank you for bringing this up.
>
> Another approach that is sometimes used in database engines (like
Felipe, thank you for bringing this up.
Another approach that is sometimes used in database engines (like DuckDB)
and is often called selection vectors, is to store another bitmask that
says which elements in the array should be "selected" and which are ignored
and functions like a view.
For
Small bikeshed: But to keep naming consistent "ViewList"?
On Wed, Apr 26, 2023 at 8:02 AM Weston Pace wrote:
> > My understanding is that the primary benefit of this ListView layout
> > over Arrow's existing List layouts [1] is that ListView allows for
> > buffer alignment [2] without padding,
> My understanding is that the primary benefit of this ListView layout
> over Arrow's existing List layouts [1] is that ListView allows for
> buffer alignment [2] without padding, which makes vectorized
> processing much more efficient. Is this understanding correct?
Yes. Though proponents of
After Weston's suggestion above, I've renamed files and classes in my WIP
implementation:
ArrayView -> ListView
On Wed, Apr 26, 2023 at 11:08 AM Ian Cook wrote:
> +1 to what Weston and Joris suggested regarding the name. "ListView"
> seems like the best name to use for this layout in Arrow.
>
+1 to what Weston and Joris suggested regarding the name. "ListView"
seems like the best name to use for this layout in Arrow.
My understanding is that the primary benefit of this ListView layout
over Arrow's existing List layouts [1] is that ListView allows for
buffer alignment [2] without
On Wed, 26 Apr 2023 at 02:37, Weston Pace wrote:
>
> For context, there was some discussion on this back in [1]. At that time
> this was called "sequence view" but I do not like that name. However,
> array-view array is a little confusing. Given this is similar to list can
> we go with
I think the ArrayVector can have benefits above:
1. Converting a Batch in Velox or other system to arrow array could be much
more lightweight.
2. Modifying, filter and copy array or string could be much more
lightweight
Velox can make a Vector mutable, seems that arrow array cannot. Seems it
I suppose one common use case is materializing list columns after some
expanding operation like a join or unnest. That's a case where I could
imagine a lot of repetition of values. Haven't yet thought of common cases
where there is overlap but not full duplication, but am eager to hear any.
The
Unless I am missing something, I think the selection use-case could be equally
well served by a dictionary-encoded BinarArray/ListArray, and would have the
benefit of not requiring any modifications to the existing format or kernels.
The major additional flexibility of the proposed encoding
Is there a need for a 64-bit offsets version the same way we have List and
LargeList?
And just to be clear, the difference with List is that the lists don't have to
be stored in their logical order (or in other words, offsets do not have to be
nondecreasing and so we also need sizes)?
On Wed,
For context, there was some discussion on this back in [1]. At that time
this was called "sequence view" but I do not like that name. However,
array-view array is a little confusing. Given this is similar to list can
we go with list-view array?
> Thanks for the introduction. I'd be interested
Hi Felipe,
Thanks for the introduction. I'd be interested to hear about the
applications Velox has found for these vectors, and in what situations they
are useful. This could be contrasted with the current ListArray
implementations.
IIUC it would be fairly cheap to transform a ListArray to an
Hi folks,
I would like to start a public discussion on the inclusion of a new array
format to Arrow — array-view array. The name is also up for debate.
This format is inspired by Velox's ArrayVector format [1]. Logically, this
array represents an array of arrays. Each element is an array-view
55 matches
Mail list logo