Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Brent Gardner Mon, 15 May 2023 14:12:46 -0700

For what it's worth, my company is building a database using arrow(rs) as
an in memory storage format, and this feature would be very helpful because
it would allow us to bitmask out mvcc rows that have been deleted / have
not yet been committed / have been rolled back, etc.


- Brent

On Mon, May 15, 2023, 06:55 Ian Cook <i...@ursacomputing.com> wrote:

> I think it would be easier for us all to weigh the costs and benefits
> of adding this proposed ListView layout to the Arrow specification and
> implementing it in the various Arrow libraries if we could all see
> some benchmarks demonstrating the performance/efficiency benefits
> compared to Arrow’s existing List layouts.
>
> Based on the Velox paper [1] and from conversations with the Velox
> developers, I would anticipate that these benchmarks will show that
> ListView confers substantial performance/efficiency benefits on some
> workloads. I suggest conferring with the Velox developers to identify
> benchmark workloads will best demonstrate the performance/efficiency
> benefit of the ListView layout while representing common real-world
> workloads.
>
> Ian
>
> [1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
>
> On Sat, May 13, 2023 at 3:09 PM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > I agree that it is hard to see  any compelling advantage of adopting
> > ListView that would incentivize adding it to DataFusion.
> >
> > It also seems like the conversion requires changing only indexes (not the
> > underlying data) so it would likely be relatively inexpensive I would
> think
> >
> > On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> > > Hi All,
> > >
> > > > if we added this, do we think many Arrow and query
> > > > engine implementations (for example, DataFusion) will be eager to add
> > > full
> > > > support for the type, including compute kernels? Or are they likely
> to
> > > just
> > > > convert this type to ListArray at import boundaries?
> > > I can't speak for query engines in general, but at least for arrow-rs
> > > and by extension DataFusion, and based on my current understanding of
> > > the use-cases I would be rather hesitant to add support to the kernels
> > > for this array type, definitely instead favouring conversion at the
> > > edges. We already have issues with the amount of code generation
> > > resulting in binary bloat and long compile times, and I worry this
> would
> > > worsen this situation whilst not really providing compelling advantages
> > > for the vast majority of workloads that don't interact with Velox.
> > > Whilst I can definitely see that the ListView representation is
> probably
> > > a better way to represent variable length lists than what arrow settled
> > > upon, I'm not yet convinced it is sufficiently better to incentivise
> > > broad ecosystem adoption.
> > >
> > > Kind Regards,
> > >
> > > Raphael Taylor-Davies
> > >
> > > On 11/05/2023 21:20, Will Jones wrote:
> > > > Hi Felipe,
> > > >
> > > > Thanks for the additional details.
> > > >
> > > >
> > > >> Velox kernels benefit from being able to append data to the array
> from
> > > >> different threads without care for strict ordering. Only the offsets
> > > array
> > > >> has to be written according to logical order but that is
> potentially a
> > > much
> > > >> smaller buffer than the values buffer.
> > > >>
> > > > It still seems to me like applications are still pretty niche, as I
> > > suspect
> > > > in most cases the benefits are outweighed by the costs. The benefit
> here
> > > > seems pretty limited: if you are trying to split work between
> threads,
> > > > usually you will have other levels such as array chunks to
> parallelize.
> > > And
> > > > if you have an incoming stream of row data, you'll want to append in
> > > > predictable order to match the order of the other arrays. Am I
> missing
> > > > something?
> > > >
> > > > And, IIUC, the cost of using ListView with out-of-order values over
> > > > ListArray is you lose memory locality; the values of element 2 are no
> > > > longer adjacent to the values of element 1. What do you think about
> that
> > > > tradeoff?
> > > >
> > > > I don't mean to be difficult about this. I'm excited for both the
> REE and
> > > > StringView arrays, but this one I'm not so sure about yet. I suppose
> > > what I
> > > > am trying to ask is, if we added this, do we think many Arrow and
> query
> > > > engine implementations (for example, DataFusion) will be eager to add
> > > full
> > > > support for the type, including compute kernels? Or are they likely
> to
> > > just
> > > > convert this type to ListArray at import boundaries?
> > > >
> > > > Because if it turns out to be the latter, then we might as well ask
> Velox
> > > > to export this type as ListArray and save the rest of the ecosystem
> some
> > > > work.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > > > felipe...@gmail.com> wrote:
> > > >
> > > >> Initial reason for ListView arrays in Arrow is zero-copy
> compatibility
> > > with
> > > >> Velox which uses this format.
> > > >>
> > > >> Velox kernels benefit from being able to append data to the array
> from
> > > >> different threads without care for strict ordering. Only the offsets
> > > array
> > > >> has to be written according to logical order but that is
> potentially a
> > > much
> > > >> smaller buffer than the values buffer.
> > > >>
> > > >> Acero kernels could take advantage of that in the future.
> > > >>
> > > >> In implementing ListViewArray/Type I was able to reuse some C++
> > > templates
> > > >> used for ListArray which can reduce some of the burden on kernel
> > > >> implementations that aim to work with all the types.
> > > >>
> > > >> I’m can fix Acero kernels for working with ListView. This is
> similar to
> > > the
> > > >> work I’ve doing in kernels dealing with run-end encoded arrays.
> > > >>
> > > >> —
> > > >> Felipe
> > > >>
> > > >>
> > > >> On Wed, 26 Apr 2023 at 01:03 Will Jones <will.jones...@gmail.com>
> > > wrote:
> > > >>
> > > >>> I suppose one common use case is materializing list columns after
> some
> > > >>> expanding operation like a join or unnest. That's a case where I
> could
> > > >>> imagine a lot of repetition of values. Haven't yet thought of
> common
> > > >> cases
> > > >>> where there is overlap but not full duplication, but am eager to
> hear
> > > >> any.
> > > >>> The dictionary encoding point Raphael makes is interesting,
> especially
> > > >>> given the existence of LargeList and FixedSizeList. For many
> > > operations,
> > > >> it
> > > >>> might make more sense to just compose those existing types.
> > > >>>
> > > >>> IIUC the operations that would be unique to the ArrayView are ones
> > > >> altering
> > > >>> the shape. One could truncate each array to a certain length
> cheaply
> > > >> simply
> > > >>> by replacing the sizes buffer. Or perhaps there are interesting
> > > >> operations
> > > >>> on tensors that would benefit.
> > > >>>
> > > >>> On Tue, Apr 25, 2023 at 7:47 PM Raphael Taylor-Davies
> > > >>> <r.taylordav...@googlemail.com.invalid> wrote:
> > > >>>
> > > >>>> Unless I am missing something, I think the selection use-case
> could be
> > > >>>> equally well served by a dictionary-encoded BinarArray/ListArray,
> and
> > > >>> would
> > > >>>> have the benefit of not requiring any modifications to the
> existing
> > > >>> format
> > > >>>> or kernels.
> > > >>>>
> > > >>>> The major additional flexibility of the proposed encoding would be
> > > >>>> permitting disjoint or overlapping ranges, are these common
> enough in
> > > >>>> practice to represent a meaningful bottleneck?
> > > >>>>
> > > >>>>
> > > >>>> On 26 April 2023 01:40:14 BST, David Li <lidav...@apache.org>
> wrote:
> > > >>>>> Is there a need for a 64-bit offsets version the same way we have
> > > List
> > > >>>> and LargeList?
> > > >>>>> And just to be clear, the difference with List is that the lists
> > > don't
> > > >>>> have to be stored in their logical order (or in other words,
> offsets
> > > do
> > > >>> not
> > > >>>> have to be nondecreasing and so we also need sizes)?
> > > >>>>> On Wed, Apr 26, 2023, at 09:37, Weston Pace wrote:
> > > >>>>>> For context, there was some discussion on this back in [1].  At
> that
> > > >>>> time
> > > >>>>>> this was called "sequence view" but I do not like that name.
> > > >> However,
> > > >>>>>> array-view array is a little confusing.  Given this is similar
> to
> > > >> list
> > > >>>> can
> > > >>>>>> we go with list-view array?
> > > >>>>>>
> > > >>>>>>> Thanks for the introduction. I'd be interested to hear about
> the
> > > >>>>>>> applications Velox has found for these vectors, and in what
> > > >>> situations
> > > >>>>>> they
> > > >>>>>>> are useful. This could be contrasted with the current ListArray
> > > >>>>>>> implementations.
> > > >>>>>> I believe one significant benefit is that take (and by proxy,
> > > >> filter)
> > > >>>> and
> > > >>>>>> sort are O(# of items) with the proposed format and O(# of
> bytes)
> > > >> with
> > > >>>> the
> > > >>>>>> current format.  Jorge did some profiling to this effect in [1].
> > > >>>>>>
> > > >>>>>> [1]
> > > >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
> > > >>>>>> On Tue, Apr 25, 2023 at 3:13 PM Will Jones <
> will.jones...@gmail.com
> > > >>>> wrote:
> > > >>>>>>> Hi Felipe,
> > > >>>>>>>
> > > >>>>>>> Thanks for the introduction. I'd be interested to hear about
> the
> > > >>>>>>> applications Velox has found for these vectors, and in what
> > > >>> situations
> > > >>>> they
> > > >>>>>>> are useful. This could be contrasted with the current ListArray
> > > >>>>>>> implementations.
> > > >>>>>>>
> > > >>>>>>> IIUC it would be fairly cheap to transform a ListArray to an
> > > >>>> ArrayView, but
> > > >>>>>>> expensive to go the other way.
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>>
> > > >>>>>>> Will Jones
> > > >>>>>>>
> > > >>>>>>> On Tue, Apr 25, 2023 at 3:00 PM Felipe Oliveira Carvalho <
> > > >>>>>>> felipe...@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi folks,
> > > >>>>>>>>
> > > >>>>>>>> I would like to start a public discussion on the inclusion of
> a
> > > >> new
> > > >>>> array
> > > >>>>>>>> format to Arrow — array-view array. The name is also up for
> > > >> debate.
> > > >>>>>>>> This format is inspired by Velox's ArrayVector format [1].
> > > >>> Logically,
> > > >>>>>>> this
> > > >>>>>>>> array represents an array of arrays. Each element is an
> > > >> array-view
> > > >>>>>>> (offset
> > > >>>>>>>> and size pair) that points to a range within a nested "values"
> > > >>> array
> > > >>>>>>>> (called "elements" in Velox docs). The nested array can be of
> any
> > > >>>> type,
> > > >>>>>>>> which makes this format very flexible and powerful.
> > > >>>>>>>>
> > > >>>>>>>> [image: ../_images/array-vector.png]
> > > >>>>>>>> <
> > > >>> https://facebookincubator.github.io/velox/_images/array-vector.png
> >
> > > >>>>>>>> I'm currently working on a C++ implementation and plan to work
> > > >> on a
> > > >>>> Go
> > > >>>>>>>> implementation to fulfill the two-implementations requirement
> for
> > > >>>> format
> > > >>>>>>>> changes.
> > > >>>>>>>>
> > > >>>>>>>> The draft design:
> > > >>>>>>>>
> > > >>>>>>>> - 3 buffers: [validity_bitmap, int32 offsets buffer, int32
> sizes
> > > >>>> buffer]
> > > >>>>>>>> - 1 child array: "values" as an array of the type parameter
> > > >>>>>>>>
> > > >>>>>>>> validity_bitmap is used to differentiate between empty array
> > > >> views
> > > >>>>>>>> (sizes[i] == 0) and NULL array views (validity_bitmap[i] ==
> 0).
> > > >>>>>>>>
> > > >>>>>>>> When the validity_bitmap[i] is 0, both sizes and offsets are
> > > >>>> undefined
> > > >>>>>>> (as
> > > >>>>>>>> usual), and when sizes[i] == 0, offsets[i] is undefined. 0 is
> > > >>>> recommended
> > > >>>>>>>> if setting a value is not an issue to the system producing the
> > > >>>> arrays.
> > > >>>>>>>> offsets buffer is not required to be ordered and views don't
> have
> > > >>> to
> > > >>>> be
> > > >>>>>>>> disjoint.
> > > >>>>>>>>
> > > >>>>>>>> [1]
> > > >>>>>>>>
> > > >>
> > >
> https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > > >>>>>>>> Thanks,
> > > >>>>>>>> Felipe O. Carvalho
> > > >>>>>>>>
> > >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to