Re: [DISCUSS][Format] Draft implementation of string view array format

2023-08-03 Thread Benjamin Kietzman
Hello all, I think that guarantees on masked values are worthwhile to define for more than a single type in isolation. In particular, requiring this exclusively for Utf8View will leave Utf8 and LargeUtf8 as arrays which *may* legally have non-utf8 masked values but cannot be consumed by arrow-rs.

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-31 Thread Matt Topol
___ > From: Raphael Taylor-Davies > Sent: Monday, July 31, 2023 12:50 AM > To: dev@arrow.apache.org > Subject: Re: [DISCUSS][Format] Draft implementation of string view array > format > > !--

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-31 Thread Pedro Eugenio Rocha Pedreira
w.apache.org Subject: Re: [DISCUSS][Format] Draft implementation of string view array format !---| This Message Is From an External Sender |---! Hi All, Having p

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-31 Thread Raphael Taylor-Davies
evelopment by creating an account on GitHub. github.com     -- Pedro Pedreira From: Weston Pace Sent: Tuesday, July 11, 2023 8:42 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS][Format] Draft implementation of st

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-12 Thread Neal Richardson
ten> > Gluten: Plugin to Double SparkSQL's Performance. Contribute to > oap-project/gluten development by creating an account on GitHub. > github.com >  >  >  >  > > > -- > Pedro Pedreira > > From: Weston Pace > Se

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-12 Thread Pedro Eugenio Rocha Pedreira
Double SparkSQL's Performance. Contribute to oap-project/gluten development by creating an account on GitHub. github.com     -- Pedro Pedreira From: Weston Pace Sent: Tuesday, July 11, 2023 8:42 AM To: dev@arrow.apache.org Subject: Re: [D

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-11 Thread Weston Pace
> I definitely hope that with time Arrow will penetrate deeper into these > engines, perhaps in a similar manner to DataFusion, as opposed to > primarily existing at the surface-level. I'm not sure the problem here is a lack of understanding or maturity. In fact, it would be much easier if this

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-10 Thread Raphael Taylor-Davies
For example, if someone (datafusion, velox, etc.) were to come up with a framework for UDFs then would batches be passed in and out of those UDFs in the Arrow format? Yes, I think the arrow format is a perfect fit for this Is Arrow meant to only be used in between systems (in this case query

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-10 Thread Weston Pace
> The point I was trying to make, albeit very badly, was that these > operations are typically implemented using some sort of row format [1] > [2], and therefore their performance is not impacted by the array > representations. I think it is both inevitable, and in fact something to > be

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-07 Thread Raphael Taylor-Davies
Thus the approach you describe for validating an entire character buffer as UTF-8 then checking offsets will be just as valid for Utf8View arrays as for Utf8 arrays. The difference here is that it is perhaps expected for Utf8View to have gaps in the underlying data that are not referenced as

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-06 Thread Benjamin Kietzman
@Andrew: Restricting these arrays to a single buffer will severely decrease their utility. Since the character data is stored in multiple character buffers writing Utf8View array can proceed without resizing allocations, which is a major overhead when writing Utf8 arrays. Furthermore since the

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-02 Thread Raphael Taylor-Davies
> I would be interested in hearing some input from the Rust community. A couple of thoughts: The variable number of buffers would definitely pose some challenges for the Rust implementation, the closest thing we currently have is possibly UnionArray, but even then the number of buffers is

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-02 Thread Andrew Lamb
> * This is the first layout where the number of buffers depends on the data > and not the schema. I think this is the most architecturally significant > fact. I I have spent some time reading the initial proposal -- thank you for that. I now understand what Weston was saying about the

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-21 Thread Antoine Pitrou
I hope implementations don't start exposing non-standard datatypes over the C Data Interface (apart from extension types, of course). I would also be wary of exposing non-standard datatypes in the official Arrow C++ implementation. Regards Antoine. Le 21/06/2023 à 14:27, Benjamin

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-21 Thread Benjamin Kietzman
> Ben, at one point there was some discussion that this might be a c-data > only type. However, I believe that was based on the raw pointers > representation. What you've proposed here, if I understand correctly, is > an index + offsets representation and it is suitable for IPC correct? > (e.g.

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-20 Thread Weston Pace
Before I say anything else I'll say that I am in favor of this new layout. There is some existing literature on the idea (e.g. umbra) and your benchmarks show some nice improvements. Compared to some of the other layouts we've discussed recently (REE, list veiw) I do think this layout is more

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-19 Thread Benjamin Kietzman
Hi Gang, I'm not sure what you mean, sorry if my answers are off base: Parquet's ByteArray will be unaffected by the addition of the string view type; all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Gang Wu
Hi Ben, The posted benchmark [1] looks pretty good to me. However, I want to raise a possible issue from the perspective of parquet-cpp. Parquet-cpp uses a customized parquet::ByteArray type [2] for string/binary, I would expect some regression of conversions between parquet reader/writer and the

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Will Jones
Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman wrote: > I've added https://github.com/apache/arrow/issues/36112 to track > deduplication of buffers on write. > I don't think it would require modification of the IPC format. > > Ben > > On Thu, Jun 15, 2023 at 1:30 

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Benjamin Kietzman
I've added https://github.com/apache/arrow/issues/36112 to track deduplication of buffers on write. I don't think it would require modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM Matt Topol wrote: > Based on my understanding, in theory a buffer *could* be shared within a >

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Matt Topol
Based on my understanding, in theory a buffer *could* be shared within a batch since the flatbuffers message just uses an offset and length to identify the buffers. That said, I don't believe any current implementation actually does this or takes advantage of this in any meaningful way. --Matt

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Will Jones
Hi Ben, It's exciting to see this move along. The buffers will be duplicated. If buffer duplication is becomes a concern, > I'd prefer to handle > that in the ipc writer. Then buffers which are duplicated could be detected > by checking > pointer identity and written only once. Question: to be

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Benjamin Kietzman
Hello again all, The PR [1] to add string view to the format and the C++ implementation is hovering around passing CI and has been undrafted. Furthermore, there is now also a PR [2] to add string view to the Go implementation. Code review is underway for each PR and I'd like to move toward a vote

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-17 Thread Benjamin Kietzman
@Jacob > You mention benchmarks multiple times, are these results published somewhere? I benchmarked the performance of raw pointer vs index offset views in my PR to velox, I do intend to port them to my arrow PR but I haven't gotten there yet. Furthermore, it seemed less urgent to me since

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Will Jones
Hello Ben, Thanks for your work on this. I think this will be an excellent addition to the format. If I understand correctly, multiple arrays can reference the same buffers in memory, but once they are written to IPC their data buffers will be duplicated. Is that right? Dictionary types have a

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Dewey Dunnington
Very cool! In addition to performance mentioned above, I could see this being useful for the R bindings - we already have a global string pool and a mechanism for keeping a vector of them alive. I don't see the C Data interface in the PR although I may have missed it - is that a part of the

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Jacob Wujciak
Hello Everyone, I think keeping interoperability with the large ecosystem is a very important goal for arrow so I am overall in favor of this proposal! You mention benchmarks multiple times, are these results published somewhere? Thanks On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman wrote:

[DISCUSS][Format] Draft implementation of string view array format

2023-05-16 Thread Benjamin Kietzman
Hello all, As previously discussed on this list [1], an UmbraDB/DuckDB/Velox compatible "string view" type could bring several performance benefits to access and authoring of string data in the arrow format [2]. Additionally better interoperability with engines already using this format could be