Re: [DISCUSS][Format] Draft implementation of string view array format

Weston Pace Tue, 11 Jul 2023 08:43:18 -0700

> I definitely hope that with time Arrow will penetrate deeper into these
> engines, perhaps in a similar manner to DataFusion, as opposed to
> primarily existing at the surface-level.


I'm not sure the problem here is a lack of understanding or maturity.  In
fact, it would be much easier if this was just a problem of education but
it is not.  Velox is already using Arrow formats (internally, not just at
the boundaries) throughout their system.  Same with DuckDb.  Both teams are
very familiar with Arrow.  They have just made the decision that the string
format is not the correct format to use in a query engine for string data.

I don't think there is a single best layout for all situations.  I don't
think we want to be in the game of declaring which layout is the correct
layout for query engines.  I suspect there will be cases where the
appropriate layout is highly dependent on the workload.

> I just wonder if inclusion in the primary
> standard is really the right place for them. Perhaps some extension
> mechanism might be the way to go here, potentially with some negotiation
> mechanism, I'm not really sure

> I agree 100% that this sort of interoperability is what makes Arrow so
> compelling and something we should work very hard to preserve. This is
> the crux of my concern with standardising alternative layouts

I did have a proposal for alternative layouts when we were discussing the
array view layout.  I'll repeat it here for completeness:

 * There are one or more primary layouts
   * Existing layouts are automatically considered primary layouts, even if
they wouldn't
     have been primary layouts initially (e.g. large list)
 * A new layout, if it is semantically equivalent to another, is considered
an alternative layout
 * An alternative layout still has the same requirements for adoption (two
implementations and a vote)
   * An implementation should not feel pressured to rush and implement the
new layout.
     It would be good if they contribute in the discussion and consider the
layout and vote
     if they feel it would be an acceptable design.
 * We can define and vote and approve as many canonical alternative layouts
as we want:
   * A canonical alternative layout should, at a minimum, have some
     reasonable justification, such as improved performance for algorithm X
 * Arrow implementations MUST support the primary layouts
 * An Arrow implementation MAY support a canonical alternative, however:
   * An Arrow implementation MUST first support the primary layout
   * An Arrow implementation MUST support conversion to/from the primary
and canonical layout
   * An Arrow implementation's APIs MUST only provide data in the
alternative
     layout if it is explicitly asked for (e.g. schema inference should
prefer the primary layout).
 * We can still vote for new primary layouts (e.g. promoting a canonical
alternative) but, in these
    votes we don't only consider the value (e.g. performance) of the layout
but also the interoperability.
    In other words, a layout can only become a primary layout if there is
significant evidence that most
    implementations plan to adopt it.

On Mon, Jul 10, 2023 at 9:49 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> > For example, if someone (datafusion, velox, etc.) were to come up with a
> > framework for UDFs then would batches be passed in and out of those UDFs
> in
> > the Arrow format?
> Yes, I think the arrow format is a perfect fit for this
> > Is Arrow meant to only be used in between systems (in this case query
> > engines) or is it also meant to be used in between components of a query
> > engine?
>
> I can't speak for all query engines, but at least in the case of
> DataFusion we exclusively use the Arrow format as the interchange format
> between operators, including for UDFs. We have found that for most
> operators operating directly on the Arrow format is sufficiently
> performant to not represent a query bottleneck. For others, such as
> joins, sorts and aggregates, we do make use of bespoke data structures
> and formats internally, e.g. hash tables, row formats, etc..., but the
> operator's public APIs are still in terms of arrow RecordBatch. We have
> found this approach to perform very well, whilst also providing very
> good modularity and composability.
>
> In fact we are actually currently in the process of migrating the
> aggregation logic away from a bespoke mutable row representation to the
> Arrow model, and are already seeing significant performance
> improvements, not to mention a significant reduction in code complexity
> and improved composability [1].
>
> > If every engine has its own bespoke formats internally
> > then it seems we are placing a limit on how far things can be decomposed.
> Agreed, if engines choose to implement operations on bespoke formats,
> these operations will likely not be as interoperable as those
> implemented using Arrow. To what extent an engine favours their own
> format(s) over Arrow will be an engineering trade-off they will have to
> make, but DataFusion has found exclusively using Arrow as the
> interchange format between operators to work well.
>
> > There are now multiple implementations of a query
> > engine and I think we are seeing just the edges of this query engine
> > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> consuming
> > a velox task as a record batch stream into a different system) and these
> > sorts of challenges are in the forefront.
> I agree 100% that this sort of interoperability is what makes Arrow so
> compelling and something we should work very hard to preserve. This is
> the crux of my concern with standardising alternative layouts. I
> definitely hope that with time Arrow will penetrate deeper into these
> engines, perhaps in a similar manner to DataFusion, as opposed to
> primarily existing at the surface-level.
>
> [1]: https://github.com/apache/arrow-datafusion/pull/6800
>
> On 10/07/2023 11:38, Weston Pace wrote:
> >> The point I was trying to make, albeit very badly, was that these
> >> operations are typically implemented using some sort of row format [1]
> >> [2], and therefore their performance is not impacted by the array
> >> representations. I think it is both inevitable, and in fact something to
> >> be encouraged, that query engines will implement their own in-memory
> >> layouts and data structures outside of the arrow specification for
> >> specific operators, workloads, hardware, etc... This allows them to make
> >> trade-offs based on their specific application domain, whilst also
> >> ensuring that new ideas and approaches can continue to be incorporated
> >> and adopted in the broader ecosystem. However, to then seek to
> >> standardise these layouts seems to be both potentially unbounded scope
> >> creep, and also somewhat counter productive if the goal of
> >> standardisation is improved interoperability?
> > FWIW, I believe this formats are very friendly for row representation as
> > well, especially when stored as a payload (e.g. in a join).
> >
> > For your more general point though I will ask the same question I asked
> on
> > the ArrayView discussion:
> >
> > Is Arrow meant to only be used in between systems (in this case query
> > engines) or is it also meant to be used in between components of a query
> > engine?
> >
> > For example, if someone (datafusion, velox, etc.) were to come up with a
> > framework for UDFs then would batches be passed in and out of those UDFs
> in
> > the Arrow format?  If every engine has its own bespoke formats internally
> > then it seems we are placing a limit on how far things can be decomposed.
> >  From the C++ perspective, I would personally like to see Arrow be usable
> > within components.  There are now multiple implementations of a query
> > engine and I think we are seeing just the edges of this query engine
> > decomposition (e.g. using arrow-c++'s datasets to feed DuckDb or
> consuming
> > a velox task as a record batch stream into a different system) and these
> > sorts of challenges are in the forefront.
> >
> > On Fri, Jul 7, 2023 at 7:53 AM Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.invalid> wrote:
> >
> >>> Thus the approach you
> >>> describe for validating an entire character buffer as UTF-8 then
> checking
> >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> >> The difference here is that it is perhaps expected for Utf8View to have
> >> gaps in the underlying data that are not referenced as part of any
> >> value, as I had understood this to be one of its benefits over the
> >> current encoding. I think it would therefore be problematic to enforce
> >> these gaps be UTF-8.
> >>
> >>> Furthermore unlike an explicit
> >>> selection vector a kernel may decide to copy and densify dynamically if
> >> it
> >>> detects that output is getting sparse or fragmented
> >> I don't see why you couldn't do something similar to materialize a
> >> sparse selection vector, if anything being able to centralise this logic
> >> outside specific kernels would be advantageous.
> >>
> >>> Specifically sorting and equality comparison
> >>> benefit significantly from the prefix comparison fast path,
> >>> so I'd anticipate that multi column sorting and aggregations would as
> >> well
> >>
> >> The point I was trying to make, albeit very badly, was that these
> >> operations are typically implemented using some sort of row format [1]
> >> [2], and therefore their performance is not impacted by the array
> >> representations. I think it is both inevitable, and in fact something to
> >> be encouraged, that query engines will implement their own in-memory
> >> layouts and data structures outside of the arrow specification for
> >> specific operators, workloads, hardware, etc... This allows them to make
> >> trade-offs based on their specific application domain, whilst also
> >> ensuring that new ideas and approaches can continue to be incorporated
> >> and adopted in the broader ecosystem. However, to then seek to
> >> standardise these layouts seems to be both potentially unbounded scope
> >> creep, and also somewhat counter productive if the goal of
> >> standardisation is improved interoperability? I fully expect in the next
> >> 5 years someone will come up with an even better way to encode strings
> >> for some particular workload or hardware, do we then incorporate that as
> >> well?
> >>
> >> I guess it boils down to what matters to people more, interoperability
> >> or best-in-class performance? Currently I think it is fair to say both
> >> arrow and parquet favour interoperability over performance, aiming to
> >> provide good enough performance broadly on the same order of magnitude
> >> as a custom solution. I personally think this is the right engineering
> >> trade-off, but appreciate opinions may differ. Ultimately I just really
> >> want arrow to avoid the situation parquet has found itself in, where the
> >> specification has both far outstripped the ability for the
> >> implementations to keep pace, whilst simultaneously having standardised
> >> approaches for things like delta encoding that are now considered
> >> extremely sub-optimal for modern hardware.
> >>
> >> That all being said I'm not against adding support for these arrays if
> >> others are already onboard, I just wonder if inclusion in the primary
> >> standard is really the right place for them. Perhaps some extension
> >> mechanism might be the way to go here, potentially with some negotiation
> >> mechanism, I'm not really sure... I will continue to think on this
> >>
> >> Kind Regards,
> >>
> >> Raphael
> >>
> >> [1]:
> >>
> >>
> https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison
> >> [2]: https://docs.rs/arrow-row/latest/arrow_row/
> >>
> >> On 06/07/2023 17:47, Benjamin Kietzman wrote:
> >>> @Andrew:
> >>>
> >>> Restricting these arrays to a single buffer will severely decrease
> their
> >>> utility. Since the character data is stored in multiple character
> buffers
> >>> writing Utf8View array can proceed without resizing allocations,
> >>> which is a major overhead when writing Utf8 arrays. Furthermore since
> the
> >>> character buffers have no restrictions on their size, it's
> >> straightforward
> >>> to
> >>> reuse an existing buffer as a character buffer rather than always
> >> allocating
> >>> a new one. In the case of creating an array which shares a lot of data
> >> with
> >>> another (for example, appending some strings) we can reuse most of the
> >>> character buffers from the original. Finally Utf8View is well adapted
> for
> >>> efficiently wrapping non-arrow string data for ingestion by a kernel,
> >> even
> >>> if the string data's full extent is not known ahead of time and is
> spread
> >>> across multiple non-contiguous buffers.
> >>>
> >>> @Raphael:
> >>>
> >>>> branch on access
> >>> The branch-on-access is unavoidable since a primary feature of the
> >> Utf8View
> >>> format is keeping short strings inline in the fixed width portion of
> >> data.
> >>> It's worth noting that the inline prefix allows skipping the branch
> >> entirely
> >>> for common cases of comparison, for example when the strings to be
> >> compared
> >>> differ within the first 4 bytes.
> >>>
> >>> In benchmarking (for example while building a hash table) I have not
> >>> observed
> >>> that this branch overly pessimizes access. Although I can't guarantee
> >> every
> >>> Utf8View array will be more efficient than any Utf8 array, it is
> >> certainly
> >>> faster for many relevant cases. Specifically sorting and equality
> >> comparison
> >>> benefit significantly from the prefix comparison fast path,
> >>> so I'd anticipate that multi column sorting and aggregations would as
> >> well.
> >>> If there are any other benchmarks which would help to justify Utf8View
> in
> >>> your
> >>> mind, I'd be happy to try writing them.
> >>>
> >>>> UTF-8 validation for StringArray can be done very efficiently by first
> >>> verifying the entire buffer, and then verifying the offsets correspond
> to
> >>> the start of a UTF-8 codepoint
> >>>
> >>> For non-inlined strings, the character buffers do always contain the
> >> entire
> >>> string's data and not just the last `len - 4` bytes. Thus the approach
> >> you
> >>> describe for validating an entire character buffer as UTF-8 then
> checking
> >>> offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
> >>>
> >>>> it does seem inconsistent to use unsigned types
> >>> It is indeed more typical for the arrow format to use signed integers
> for
> >>> offsets and other quantities. In this case there is prior art in other
> >>> engines with which we can remain compatible by using unsigned integers
> >>> instead. Since this is only a break with convention within the format
> and
> >>> shouldn't be difficult for any implementation to accommodate, I would
> >> argue
> >>> that it's worthwhile to avoid pushing change onto existing
> implementers.
> >>>
> >>>> I presume that StringView will behave similarly to dictionaries in
> that
> >>> the selection kernels will not recompute the underlying value buffers.
> >>>
> >>> The Utf8View format itself is not prescriptive of selection operations
> on
> >>> the
> >>> array; kernels are free to reuse character buffers (which produces an
> >>> implicit
> >>> selection vector) or to recompute them. Furthermore unlike an explicit
> >>> selection vector a kernel may decide to copy and densify dynamically if
> >> it
> >>> detects that output is getting sparse or fragmented. It's also worth
> >> noting
> >>> that unlike an explicit selection vector a Utf8View array (however
> >> sparse or
> >>> fragmented) will still benefit from the prefix comparison fast path.
> >>>
> >>> Sincerely,
> >>> Ben Kietzman
> >>>
> >>> On Sun, Jul 2, 2023 at 8:01 AM Raphael Taylor-Davies
> >>> <r.taylordav...@googlemail.com.invalid>  wrote:
> >>>
> >>>>> I would be interested in hearing some input from the Rust community.
> >>>>    A couple of thoughts:
> >>>>
> >>>> The variable number of buffers would definitely pose some challenges
> for
> >>>> the Rust implementation, the closest thing we currently have is
> possibly
> >>>> UnionArray, but even then the number of buffers is still determined
> >>>> statically by the DataType. I therefore also wonder about the
> >> possibility
> >>>> of always having a single backing buffer that stores the character
> data,
> >>>> including potentially a copy of the prefix. This would also avoid
> >> forcing a
> >>>> branch on access, which I would have expected to hurt performance for
> >> some
> >>>> kernels quite significantly.
> >>>>
> >>>> Whilst not really a concern for Rust, which supports unsigned types,
> it
> >>>> does seem inconsistent to use unsigned types where the rest of the
> >> format
> >>>> encourages the use of signed offsets, etc...
> >>>>
> >>>> It isn't clearly specified whether a null should have a valid set of
> >>>> offsets, etc... I think it is an important property of the current
> array
> >>>> layouts that, with exception to dictionaries, the data in null slots
> is
> >>>> arbitrary, i.e. can take any value, but not undefined. This allows for
> >>>> separate handling of the null mask and values, which can be important
> >> for
> >>>> some kernels and APIs.
> >>>>
> >>>> More an observation than an issue, but UTF-8 validation for
> StringArray
> >>>> can be done very efficiently by first verifying the entire buffer, and
> >> then
> >>>> verifying the offsets correspond to the start of a UTF-8 codepoint.
> This
> >>>> same approach would not be possible for StringView, which would need
> to
> >>>> verify individual values and would therefore be significantly more
> >>>> expensive. As it is UB for a Rust string to contain non-UTF-8 data,
> this
> >>>> validation is perhaps more important for Rust than for other
> languages.
> >>>>
> >>>> I presume that StringView will behave similarly to dictionaries in
> that
> >>>> the selection kernels will not recompute the underlying value
> buffers. I
> >>>> think this is fine, but it is perhaps worth noting this has caused
> >>>> confusion in the past, as people somewhat reasonably expect an array
> >>>> post-selection to have memory usage reflecting the smaller selection.
> >> This
> >>>> is then especially noticeable if the data is written out to IPC, and
> >> still
> >>>> contains data that was supposedly filtered out. My 2 cents is that
> >> explicit
> >>>> selection vectors are a less surprising way to defer selection than
> >> baking
> >>>> it into the array, but I also don't have any workloads where this is
> the
> >>>> major bottleneck so can't speak authoritatively here.
> >>>>
> >>>> Which leads on to my major concern with this proposal, that it adds
> >>>> complexity and cognitive load to the specification and
> implementations,
> >>>> whilst not meaningfully improving the performance of the operators
> that
> >> I
> >>>> commonly encounter as performance bottlenecks, which are multi-column
> >> sorts
> >>>> and aggregations, or the expensive string operations such as matching
> or
> >>>> parsing. If we didn't already have a string representation I would be
> >> more
> >>>> onboard, but as it stands I'm definitely on the fence, especially
> given
> >>>> selection performance can be improved in less intrusive ways using
> >>>> dictionaries or selection vectors.
> >>>>
> >>>> Kind Regards,
> >>>>
> >>>> Raphael Taylor-Davies
> >>>>
> >>>> On 02/07/2023 11:46, Andrew Lamb wrote:
> >>>>
> >>>>    * This is the first layout where the number of buffers depends on
> the
> >>>>
> >>>> data
> >>>>
> >>>> and not the schema. I think this is the most architecturally
> significant
> >>>> fact. I
> >>>>
> >>>>    I have spent some time reading the initial proposal -- thank you
> for
> >>>> that. I now understand what Weston was saying about the "variable
> >> numbers
> >>>> of buffers". I wonder if you considered restricting such arrays to a
> >> single
> >>>> buffer (so as to make them more similar to other arrow array types
> that
> >>>> have a fixed number of buffers)? On Tue, Jun 20, 2023 at 11:33 AM
> Weston
> >>>> Pace<weston.p...@gmail.com>  <mailto:weston.p...@gmail.com>  wrote:
> >>>>
> >>>> Before I say anything else I'll say that I am in favor of this new
> >> layout.
> >>>> There is some existing literature on the idea (e.g. umbra) and your
> >>>> benchmarks show some nice improvements. Compared to some of the other
> >>>> layouts we've discussed recently (REE, list veiw) I do think this
> >> layout is
> >>>> more unique and fundamentally different. Perhaps most fundamentally
> >>>> different: * This is the first layout where the number of buffers
> >> depends
> >>>> on the data and not the schema. I think this is the most
> architecturally
> >>>> significant fact. It does require a (backwards compatible) change to
> the
> >>>> IPC format itself, beyond just adding new type codes. It also poses
> >>>> challenges in places where we've assumed there will be at most 3
> buffers
> >>>> (e.g. in ArraySpan, though, as you have shown, we can work around this
> >>>> using a raw pointers representation internally in those spots). I
> think
> >>>> you've done some great work to integrate this well with Arrow-C++ and
> >> I'm
> >>>> convinced it can work. I would be interested in hearing some input
> from
> >> the
> >>>> Rust community. Ben, at one point there was some discussion that this
> >> might
> >>>> be a c-data only type. However, I believe that was based on the raw
> >>>> pointers representation. What you've proposed here, if I understand
> >>>> correctly, is an index + offsets representation and it is suitable for
> >> IPC
> >>>> correct? (e.g. I see that you have changes and examples in the IPC
> >>>> reader/writer) On Mon, Jun 19, 2023 at 7:17 AM Benjamin Kietzman <
> >>>> bengil...@gmail.com> <mailto:bengil...@gmail.com>  wrote:
> >>>>
> >>>> Hi Gang, I'm not sure what you mean, sorry if my answers are off base:
> >>>> Parquet's ByteArray will be unaffected by the addition of the string
> >> view
> >>>> type; all arrow strings
> (arrow::Type::STRING,arrow::Type::LARGE_STRING,
> >>>> and with this patcharrow::Type::STRING_VIEW) are converted to
> ByteArrays
> >>>> during serialization to parquet [1]. If you mean that encoding of
> >>>> arrow::Type::STRING_VIEW  will not be as fast as encoding of
> equivalent
> >>>> arrow::Type::STRING, that's something I haven't benchmarked so I can't
> >>>> answer definitively. I would expect it to be
> >>>>
> >>>> faster
> >>>>
> >>>> than first converting STRING_VIEW->STRING then encoding to parquet;
> >> direct
> >>>> encoding avoids allocating and populating temporary buffers. Of course
> >>>>
> >>>> this
> >>>>
> >>>> only applies to cases where you need to encode an array of STRING_VIEW
> >> to
> >>>> parquet- encoding of STRING to parquet will be unaffected. Sincerely,
> >> Ben
> >>>> [1]
> >>>>
> >>>>
> >>>>
> >>
> https://github.com/bkietz/arrow/blob/46cf7e67766f0646760acefa4d2d01cdfead2d5d/cpp/src/parquet/encoding.cc#L166-L179
> >>>>    On Thu, Jun 15, 2023 at 10:34 PM Gang Wu<ust...@gmail.com>
> <mailto:
> >> ust...@gmail.com>  wrote:
> >>>> Hi Ben, The posted benchmark [1] looks pretty good to me. However, I
> >> want
> >>>> to raise a possible issue from the perspective of parquet-cpp.
> >> Parquet-cpp
> >>>> uses a customizedparquet::ByteArray  type [2] for string/binary, I
> >>>>
> >>>> would
> >>>>
> >>>> expect some regression of conversions between parquet reader/writer
> and
> >>>> the proposed string view array, especially when some strings use short
> >> form
> >>>> and others use long form. [1]
> >>>>
> >>>>
> >>>>
> >>
> https://github.com/apache/arrow/blob/41309de8dd91a9821873fc5f94339f0542ca0108/cpp/src/parquet/types.h#L575
> >>>> [2]https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >>>> Best, Gang On Fri, Jun 16, 2023 at 3:58 AM Will Jones <
> >>>> will.jones...@gmail.com> <mailto:will.jones...@gmail.com>  wrote:
> >>>>
> >>>> Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin
> >>>> Kietzman <bengil...@gmail.com  <mailto:bengil...@gmail.com>
> >>>>
> >>>> wrote:
> >>>>
> >>>> I've addedhttps://github.com/apache/arrow/issues/36112  to track
> >>>> deduplication of buffers on write. I don't think it would require
> >>>> modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM
> Matt
> >>>> Topol <zotthewiz...@gmail.com  <mailto:zotthewiz...@gmail.com>
> >>>>
> >>>> wrote:
> >>>>
> >>>> Based on my understanding, in theory a buffer *could* be shared
> >>>>
> >>>> within
> >>>>
> >>>> a
> >>>>
> >>>> batch since the flatbuffers message just uses an offset and
> >>>>
> >>>> length
> >>>>
> >>>> to
> >>>>
> >>>> identify the buffers. That said, I don't believe any current
> >>>> implementation actually
> >>>>
> >>>> does
> >>>>
> >>>> this
> >>>>
> >>>> or
> >>>>
> >>>> takes advantage of this in any meaningful way. --Matt On Thu, Jun 15,
> >> 2023
> >>>> at 1:00 PM Will Jones <
> >>>>
> >>>> will.jones...@gmail.com  <mailto:will.jones...@gmail.com>>
> >>>>
> >>>> wrote:
> >>>>
> >>>> Hi Ben, It's exciting to see this move along. The buffers will be
> >>>> duplicated. If buffer duplication is
> >>>>
> >>>> becomes
> >>>>
> >>>> a
> >>>>
> >>>> concern,
> >>>>
> >>>> I'd prefer to handle that in the ipc writer. Then buffers which are
> >>>> duplicated
> >>>>
> >>>> could
> >>>>
> >>>> be
> >>>>
> >>>> detected
> >>>>
> >>>> by checking pointer identity and written only once.
> >>>>
> >>>>    Question: to be able to write buffer only once and reference in
> >>>>
> >>>> multiple
> >>>>
> >>>> arrays, does that require a change to the IPC format? Or is
> >>>>
> >>>> sharing
> >>>>
> >>>> buffers
> >>>>
> >>>> within the same batch already allowed in the IPC format? Best, Will
> >> Jones
> >>>> On Thu, Jun 15, 2023 at 9:03 AM Benjamin Kietzman <
> >>>>
> >>>> bengil...@gmail.com  <mailto:bengil...@gmail.com>
> >>>>
> >>>> wrote:
> >>>>
> >>>> Hello again all, The PR [1] to add string view to the format and the
> C++
> >>>>
> >>>> implementation
> >>>>
> >>>> is
> >>>>
> >>>> hovering around passing CI and has been undrafted.
> >>>>
> >>>> Furthermore,
> >>>>
> >>>> there
> >>>>
> >>>> is
> >>>>
> >>>> now also a PR [2] to add string view to the Go
> >>>>
> >>>> implementation.
> >>>>
> >>>> Code
> >>>>
> >>>> review
> >>>>
> >>>> is underway for each PR and I'd like to move toward a vote
> >>>>
> >>>> for
> >>>>
> >>>> acceptance-
> >>>>
> >>>> are there any other preliminaries which I've neglected? To reiterate
> the
> >>>> answers to some past questions: - Benchmarks are added in the C++ PR
> >> [1] to
> >>>> demonstrate the
> >>>>
> >>>> performance
> >>>>
> >>>> of
> >>>>
> >>>>    conversion between the various string formats. In addition,
> >>>>
> >>>> there
> >>>>
> >>>> are
> >>>>
> >>>>    some benchmarks which demonstrate the performance gains
> >>>>
> >>>> available
> >>>>
> >>>> with
> >>>>
> >>>>    the new format [3]. - Adding string view to the C ABI is a natural
> >> follow
> >>>> up, but
> >>>>
> >>>> should
> >>>>
> >>>> be
> >>>>
> >>>>    handled independently. An issue has been added to track
> >>>>
> >>>> that
> >>>>
> >>>>    enhancement [4]. Sincerely, Ben Kietzman [1]
> >>>> https://github.com/apache/arrow/pull/35628  [2]
> >>>> https://github.com/apache/arrow/pull/35769  [3]
> >>>>
> >>>> https://github.com/apache/arrow/pull/35628#issuecomment-1583218617
> >>>>
> >>>> [4]https://github.com/apache/arrow/issues/36099  On Wed, May 17, 2023
> >> at
> >>>> 12:53 PM Benjamin Kietzman <
> >>>>
> >>>> bengil...@gmail.com  <mailto:bengil...@gmail.com>>
> >>>>
> >>>> wrote:
> >>>>
> >>>> @Jacob
> >>>>
> >>>> You mention benchmarks multiple times, are these results
> >>>>
> >>>> published
> >>>>
> >>>> somewhere? I benchmarked the performance of raw pointer vs index
> >>>>
> >>>> offset
> >>>>
> >>>> views
> >>>>
> >>>> in
> >>>>
> >>>> my
> >>>>
> >>>> PR to velox, I do intend to port them to my arrow PR but I haven't
> >>>>
> >>>> gotten
> >>>>
> >>>> there
> >>>>
> >>>> yet.
> >>>>
> >>>> Furthermore, it seemed less urgent to me since coexistence of the two
> >>>> types
> >>>>
> >>>> in
> >>>>
> >>>> the
> >>>>
> >>>> c++
> >>>>
> >>>> implementation defers the question of how aggressively one should be
> >>>>
> >>>> preferred
> >>>>
> >>>> over
> >>>>
> >>>> the
> >>>>
> >>>> other. @Dewey
> >>>>
> >>>> I don't see the C Data interface in the PR
> >>>>
> >>>>    I have not addressed the C ABI in this PR. As you mention,
> >>>>
> >>>> it
> >>>>
> >>>> may
> >>>>
> >>>> be
> >>>>
> >>>> useful to transmit arrays with raw pointer views between
> implementations
> >>>> which
> >>>>
> >>>> allow
> >>>>
> >>>> them. I
> >>>>
> >>>> can address this in a follow up PR. @Will
> >>>>
> >>>> If I understand correctly, multiple arrays can reference
> >>>>
> >>>> the
> >>>>
> >>>> same
> >>>>
> >>>> buffers
> >>>>
> >>>> in memory, but once they are written to IPC their data
>

Re: [DISCUSS][Format] Draft implementation of string view array format

Reply via email to