Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Weston Pace
I think "inheritance" and "composition" are more concerns for implementations than they are for spec (I could be wrong here). So it seems that it would be sufficient to write the HLLSKETCH's canonical definition as "this is an extension of the JSON logical type and supports all the same storage

Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Weston Pace
+1 (binding) On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc wrote: > Thanks for all the reviews and comments! I've included the big-endian > requirement so the proposed language is now as below. > I'll leave the vote open until after the May holiday. > > Rok > > UUID > > > * Extension name:

Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Weston Pace
+1 (binding) I agree we should be explicit about RFC-8259 On Mon, Apr 29, 2024 at 4:46 PM David Li wrote: > +1 (binding) > > assuming we explicitly state RFC-8259 > > On Tue, Apr 30, 2024, at 08:02, Matt Topol wrote: > > +1 (binding) > > > > On Mon, Apr 29, 2024 at 5:36 PM Ian Cook wrote: > >

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-24 Thread Weston Pace
ld > > > it error, or should it create a table with a single column? > > > > Presumably it should just error? I can see this being ambiguous if there > > were an API that dynamically returned either a table or a column based on > > the input shape (where befo

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Weston Pace
> *As per Apache Parquet Community Parquet V2 is not final yet so it is not > official . They are advising not to use Parquet V2 for writing (though code > is available ) .* This would be news to me. Parquet releases are listed (by the parquet community) at [1] The vote to release parquet 2.10

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-04-22 Thread Weston Pace
I tend to agree with Dewey. Using run-end-encoding to represent a scalar is clever and would keep the c data interface more compact. Also, a struct array is a superset of a record batch (assuming the metadata is kept in the schema). Consumers should always be able to deserialize into a struct

Re: Unsupported/Other Type

2024-04-17 Thread Weston Pace
> people generally find use in Arrow schemas independently of concrete data. This makes sense. I think we do want to encourage use of Arrow as a "type system" even if there is no data involved. And, given that we cannot easily change a field's data type property to "optional" it makes sense to

Re: Unsupported/Other Type

2024-04-17 Thread Weston Pace
> may want an Other type to signal that it would fail if asked to provide particular columns. I interpret "would fail" to mean we are still speaking in some kind of "planning stage" and not yet actually creating arrays. So I don't know that this needs to be a data type. In other words, I see

Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread Weston Pace
Congratulations! On Thu, Apr 11, 2024 at 9:12 AM wish maple wrote: > Congrats! > > Best, > Xuwei Fu > > Kevin Gurney 于2024年4月11日周四 23:22写道: > > > Congratulations, Sarah!! Well deserved! > > > > From: Jacob Wujciak > > Sent: Thursday, April 11, 2024 11:14 AM >

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-08 Thread Weston Pace
> Probably major versions should match between C++ and PyArrow, but I guess > we could have diverging minor and patch versions. Or at least patch > versions given that > a new minor version is usually cut for bug fixes too. I believe even this would be difficult. Stable ABIs are very finicky in

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-04-02 Thread Weston Pace
Forgot link: [1] https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/Memory On Tue, Apr 2, 2024 at 11:38 AM Weston Pace wrote: > Thanks for taking the time to address my concerns. > > > I've split the S3/HTTP URI flight pieces out into a separate document and

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-04-02 Thread Weston Pace
than a markdown PR for the Arrow documentation as I could > > > more visually express things without a preview of the rendered > markdown. > > If > > > it would get people to be more likely to vote on this, I can write up > the > > > documentation markd

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Weston Pace
Wouldn't support for ADT require expressing more than 1 type id per record? In other words, if `put` has type id 1, `delete` has type id 2, and `erase` has type id 3 then there is no way to express something is (for example) both type id 1 and type id 3 because you can only have one type id per

Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread Weston Pace
Congratulations Joel! On Mon, Apr 1, 2024 at 1:16 PM Bryce Mecum wrote: > Congrats, Joel! > > On Mon, Apr 1, 2024 at 6:59 AM Matt Topol wrote: > > > > On behalf of the Arrow PMC, I'm happy to announce that Joel Lubinitsky > has > > accepted an invitation to become a committer on Apache Arrow.

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-03-29 Thread Weston Pace
Thank you for bringing this up. I'm in favor of this. I think there are several motivations but the main ones are: 1. Decoupling the versions will allow components to have no release, or only a minor release, when there are no breaking changes 2. We do have some vote fatigue I think and we

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-03-28 Thread Weston Pace
I'm sorry for the very late reply. Until yesterday I had no real concept of what this was talking about and so I had stayed out. I'm +0 only because it isn't clear what we are voting on. There is a word doc with no implementation or PR. I think there could be an implementation / PR. For

Re: Apache Arrow Flight - From Rust to Javascript (FlightData)

2024-03-21 Thread Weston Pace
> I don't think there is currently a direct equivalent to > `FlightRecordBatchStream` in the arrow javascript library, but you should > be able to combine the data header + body and then read it using the > `fromIPC` functions since it's just the Arrow IPC format The RecordBatchReader[1] _should_

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-17 Thread Weston Pace
Congratulations! On Sun, Mar 17, 2024, 8:01 PM Jacob Wujciak wrote: > Congrats, well deserved! > > Nic Crane schrieb am Mo., 18. März 2024, 03:24: > > > On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has > > accepted an invitation to become a committer on Apache Arrow.

Re: [DISCUSS] Looking for feedback on my Rust library

2024-03-14 Thread Weston Pace
Felipe's points are good. I don't know that you need to adapt the entire ADBC, it sort of depends what you're after. I see what you've got right now as more of an SQL abstraction layer. For example, similar to things like [1][2][3] (though 3 is more of an ORM). If you like the SQL interface

Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Weston Pace
+1 (binding) On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb wrote: > Hello, > > As we have discussed[1][2] I would like to vote on the proposal to > create a new Apache Top Level Project for DataFusion. The text of the > proposed resolution and background document is copy/pasted below > > If the

Re: [ANNOUNCE] New Arrow committer: Jay Zhan

2024-02-16 Thread Weston Pace
Congrats! On Fri, Feb 16, 2024 at 3:07 AM Raúl Cumplido wrote: > Congratulations!! > > El vie, 16 feb 2024 a las 12:02, Daniël Heres > () escribió: > > > > Congratulations! > > > > On Fri, Feb 16, 2024, 11:33 Metehan Yıldırım < > metehan.yildi...@synnada.ai> > > wrote: > > > > > Congrats! > > >

Re: [DISC] Improve Arrow Release verification process

2024-01-21 Thread Weston Pace
+1. There have been a few times I've attempted to run the verification scripts. They have failed, but I was pretty confident it was a problem with my environment mixing with the verification script and not a problem in the software itself and I didn't take the time to debug the verification

Re: [DISCUSS] Semantics of extension types

2023-12-14 Thread Weston Pace
I agree engines can use their own strategy. Requiring explicit casts is probably ok as long as it is well documented but I think I lean slightly towards implicitly falling back to the storage type. I do think think people still shy away from extension types. Adding the extension type to an

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Weston Pace
t least 72 hours. > > > > > > [ ] +1 > > > [ ] +0 > > > [ ] -1 Keep Flight SQL experimental because... > > > > > > On Fri, Dec 8, 2023, at 13:37, Weston Pace wrote: > > >> +1 > > >> > > >> On Fri,

Re: [DISCUSS] Flight SQL as experimental

2023-12-08 Thread Weston Pace
+1 On Fri, Dec 8, 2023 at 10:33 AM Micah Kornfield wrote: > +1 > > On Fri, Dec 8, 2023 at 10:29 AM Andrew Lamb wrote: > > > I agree it is time to "promote" ArrowFlightSQL to the same level as other > > standards in Arrow > > > > Now that it is used widely (we use and count on it too at

Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread Weston Pace
Congratulations Felipe! On Thu, Dec 7, 2023 at 8:38 AM wish maple wrote: > Congrats Felipe!!! > > Best, > Xuwei Fu > > Benjamin Kietzman 于2023年12月7日周四 23:42写道: > > > On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira > > Carvalho > > has accepted an invitation to become a

Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread Weston Pace
Congrats Andy! On Mon, Nov 27, 2023, 7:31 PM wish maple wrote: > Congrats Andy! > > Best, > Xuwei Fu > > Andrew Lamb 于2023年11月27日周一 20:47写道: > > > I am pleased to announce that the Arrow Project has a new PMC chair and > VP > > as per our tradition of rotating the chair once a year. I have

Re: [ANNOUNCE] New Arrow committer: James Duong

2023-11-17 Thread Weston Pace
Congratulations James On Fri, Nov 17, 2023 at 6:07 AM Metehan Yıldırım < metehan.yildi...@synnada.ai> wrote: > Congratulations! > > On Thu, Nov 16, 2023 at 10:45 AM Sutou Kouhei wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that James Duong > > has accepted an invitation to

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Weston Pace
Congratulations Raúl! On Mon, Nov 13, 2023 at 1:34 PM Ben Harkins wrote: > Congrats, Raúl!! > > On Mon, Nov 13, 2023 at 4:30 PM Bryce Mecum wrote: > > > Congrats, Raúl! > > > > On Mon, Nov 13, 2023 at 10:28 AM Andrew Lamb > > wrote: > > > > > > The Project Management Committee (PMC) for

Re: [DISCUSS][Format] C data interface for Utf8View

2023-11-07 Thread Weston Pace
+1 for the original proposal as well. --- The (minor) problem I see with flags is that there isn't much point to this feature if you are gating on a flag. I'm assuming the goal is what Dewey originally mentioned which is making buffer calculations easier. However, if you're gating the feature

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Weston Pace
Is this buffer lengths buffer only present if the array type is Utf8View? Or are you suggesting that other types might want to adopt this as well? On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington wrote: > > I expect C code to not be much longer then this :-) > > nanoarrow's

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Weston Pace
Congratulations Xuwei! On Mon, Oct 23, 2023 at 3:38 AM wish maple wrote: > Thanks kou and every nice person in arrow community! > > I've learned a lot during learning and contribution to arrow and > parquet. Thanks for everyone's help. > Hope we can bring more fancy features in the future! > >

Re: Apache Arrow file format

2023-10-21 Thread Weston Pace
> Of course, what I'm really asking for is to see how Lance would compare ;-) > P.S. The second paper [2] also talks about ML workloads (in Section 5.8) > and GPU performance (in Section 5.9). It also cites Lance as one of the > future formats (in Section 5.6.2). Disclaimer: I work for LanceDb

Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-15 Thread Weston Pace
Congratulations Jon! On Sun, Oct 15, 2023, 1:51 PM Neal Richardson wrote: > Congratulations! > > On Sun, Oct 15, 2023 at 1:35 PM Bryce Mecum wrote: > > > Congratulations, Jon! > > > > On Sat, Oct 14, 2023 at 9:24 AM Andrew Lamb > wrote: > > > > > > The Project Management Committee (PMC) for

Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread Weston Pace
Congratulations! On Sun, Oct 15, 2023, 8:51 AM Gang Wu wrote: > Congrats! > > On Sun, Oct 15, 2023 at 10:49 PM David Li wrote: > > > Congrats & welcome Curt! > > > > On Sun, Oct 15, 2023, at 09:03, wish maple wrote: > > > Congratulations! > > > > > > Raúl Cumplido 于2023年10月15日周日 20:48写道: > >

Re: [DISCUSS][C++] Raw pointer string views

2023-10-06 Thread Weston Pace
> I feel the broader question here is what is Arrow's intended use case - interchange or execution The line between interchange and execution is not always clear. For example, I think we would like Arrow to be considered as a standard for UDF libraries. On Fri, Oct 6, 2023 at 7:34 AM Mark

Re: [Discuss][C++] A framework for contextual/implicit/ambient vars

2023-08-24 Thread Weston Pace
In other languages I have seen this called "async local"[1][2][3]. I'm not sure of any C++ implementations. Folly's fibers claim to have fiber-local variables[4] but I can't find the actual code to use them. I can't seem to find reference to the concept in boost's asio or cppcoro. I've also

Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-21 Thread Weston Pace
+1 Thanks to all for the discussion and thanks to Ben for all of the great work. On Mon, Aug 21, 2023 at 9:16 AM wish maple wrote: > +1 (non-binding) > > It would help a lot when processing UTF-8 related data! > > Xuwei > > Andrew Lamb 于2023年8月22日周二 00:11写道: > > > +1 > > > > This is a great

Re: Acero and Substrait: How to select struct field from a struct column?

2023-08-07 Thread Weston Pace
> But I can't figure out how to express "select struct field 0 from field 2 > of the original table where field 2 is a struct column" > > Any idea how the substrait message should look like for the above? I believe it would be: ``` "expression": { "selection": { "direct_reference": {

Re: [DISCUSS] Canonical alternative layout proposal

2023-08-02 Thread Weston Pace
> I would welcome a draft PR showcasing the changes necessary in the IPC > format definition, and in the C Data Interface specification (no need to > actually implement them for now :-)). I've proposed something at [1]. > One sketch of an idea: define sets of types that we can call “kinds”** >

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-31 Thread Weston Pace
8/16, The system works fine. CPU is about > > 100%. like 2.1.1 > > 2.2.2 for bucket_size to 32, the bug comes back. CPU halts at 550%. > > > > 2.3 io_thread_count to 8 > > 2.3.1 for bucket_size to 16, it fails somehow. After transferring > > done, the memory accu

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-28 Thread Weston Pace
well, to 800%. > 1. Sometimes, the writing queue can overcome, CPU will goes down after > the memory accumulated. The writing speed recoved and memory back to > normal. > 2. Sometimes, it can't. IOBPS goes down sharply, and CPU never goes > down after that. > > How many io th

Re: dataset write stucks on ThrottledAsyncTaskSchedulerImpl

2023-07-27 Thread Weston Pace
You'll need to measure more but generally the bottleneck for writes is usually going to be the disk itself. Unfortunately, standard OS buffered I/O has some pretty negative behaviors in this case. First I'll describe what I generally see happen (the last time I profiled this was a while back but

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-26 Thread Weston Pace
ery helpful explanation. > > On Tue, Jul 25, 2023 at 6:41 PM Weston Pace wrote: > > > 1) As a rule of thumb I would probably prefer `async_scheduler`. It's > more > > feature rich and simpler to use and is meant to handle "long running" > tasks > > (e.g.

Re: how to make acero output order by batch index

2023-07-26 Thread Weston Pace
above it is probably ok to assume an implicit ordering in many cases). On Wed, Jul 26, 2023 at 8:18 AM Weston Pace wrote: > > I think the key problem is that the input stream is unordered. The > > input stream is a ArrowArrayStream imported from python side, and

Re: how to make acero output order by batch index

2023-07-26 Thread Weston Pace
like to have a discuss on dataset scanner, is it produce a > > stable sequence of record batches (as an implicit ordering) when the > > underlying storage is not changed? For my situation, the downstream > > executor may crush, then it would request to continue from a > > intermediate

Re: scheduler() and aync_scheduler() on QueryContext

2023-07-25 Thread Weston Pace
1) As a rule of thumb I would probably prefer `async_scheduler`. It's more feature rich and simpler to use and is meant to handle "long running" tasks (e.g. 10s-100s of ms or more). The scheduler is a bit more complex and is intended for very fine-grained scheduling. It's currently only used in

Re: how to make acero output order by batch index

2023-07-25 Thread Weston Pace
> Reading the source code of exec_plan.cc, DeclarationToReader called > DeclarationToRecordBatchGenerator, which ignores the sequence_output > parameter in SinkNodeOptions, also, it calls validate which should > fail if the SinkNodeOptions honors the sequence_output. Then it seems > that

Re: hashing Arrow structures

2023-07-24 Thread Weston Pace
> Also, I don't understand why there are two versions of the hash table > ("hashing32" and "hashing64" apparently). What's the rationale? How is > the user meant to choose between them? Say a Substrait plan is being > executed: which hashing variant is chosen and why? It's not user-configurable.

Re: hashing Arrow structures

2023-07-21 Thread Weston Pace
Yes, those are the two main approaches to hashing in the code base that I am aware of as well. I haven't seen any real concrete comparison and benchmarks between the two. If collisions between NA and 0 are a problem it would probably be ok to tweak the hash value of NA to something unique. I

Re: Need help on ArrayaSpan and writing C++ udf

2023-07-17 Thread Weston Pace
> I may be missing something, but why copy to *out_values++ instead of > *out_values and add 32 to out_values afterwards? Otherwise I agree this is > the way to go. I agree with Jin. You should probably be incrementing `out` by 32 each time `VisitValue` is called. On Mon, Jul 17, 2023 at 6:38 

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-11 Thread Weston Pace
at this sort of interoperability is what makes Arrow so > compelling and something we should work very hard to preserve. This is > the crux of my concern with standardising alternative layouts. I > definitely hope that with time Arrow will penetrate deeper into these > engines, perhaps in a si

Re: Confusion on substrait AggregateRel::groupings and Arrow consumer

2023-07-10 Thread Weston Pace
Yes, that is correct. What Substrait calls "groupings" is what is often referred to in SQL as "grouping sets". These allow you to compute the same aggregates but group by different criteria. Two very common ways of creating grouping sets are "group by cube" and "group by rollup". Snowflake's

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-07-10 Thread Weston Pace
s on to my major concern with this proposal, that it adds > >> complexity and cognitive load to the specification and implementations, > >> whilst not meaningfully improving the performance of the operators that > I > >> commonly encounter as performance bottle

Re: Do we need CODEOWNERS ?

2023-07-04 Thread Weston Pace
I agree the experiment isn't working very well. I've been meaning to change my listing from `compute` to `acero` for a while. I'd be +1 for just removing it though. On Tue, Jul 4, 2023, 6:44 AM Dewey Dunnington wrote: > Just a note that for me, the main problem is that I get automatic >

Re: [ANNOUNCE] New Arrow committer: Kevin Gurney

2023-07-03 Thread Weston Pace
Congratulations Kevin! On Mon, Jul 3, 2023 at 5:18 PM Sutou Kouhei wrote: > On behalf of the Arrow PMC, I'm happy to announce that Kevin Gurney > has accepted an invitation to become a committer on Apache > Arrow. Welcome, and thank you for your contributions! > > -- > kou >

Re: Question about large exec batch in acero

2023-07-03 Thread Weston Pace
> is this overflow considered a bug? Or is large exec batch something that should be avoided? This is not a bug and it is something that should be avoided. Some of the hash-join internals expect small batches. I actually thought the limit was 32Ki and not 64Ki because I think there may be some

Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-29 Thread Weston Pace
Is your use case to operate on a batch of graphs? For example, do you have hundreds or thousands of graphs that you need to run these algorithms on at once? Or is your use case to operate on a single large graph? If it's the single-graph case then how many nodes do you have? If it's one graph

Re: [C++] Dealing with third party method that raises exception

2023-06-29 Thread Weston Pace
We do this quite a bit in the Arrow<->Parquet bridge if IIUC. There are macros defined like this: ``` #define BEGIN_PARQUET_CATCH_EXCEPTIONS try { #define END_PARQUET_CATCH_EXCEPTIONS \ }\ catch (const

Re: Question about nested columnar validity

2023-06-29 Thread Weston Pace
>> 2. For StringView and ArrayView, if the parent has `validity = false`. >> If they have `validity = true`, there offset might point to a invalid >> position >I have no idea, but I hope not. Ben Kietzman might want to answer more >precisely here. I think, for view arrays, the offsets

Re: Question about nested columnar validity

2023-06-28 Thread Weston Pace
I agree with Antoine but I get easily confused by "valid, as in structurally correct" and "valid, as in not null" so I want to make sure I understand: > The child of a nested > array should be valid itself, independently of the parent's validity bitmap. A child must be "structurally correct"

Re: Enabling apache/arrow GitHub dependency graph with vcpkg

2023-06-28 Thread Weston Pace
Thanks for reaching out. This sounds like a useful tool and I'm happy to hear about more development around establishing supply chain awareness. However, Arrow is an Apache Software Project and, as such, we don't manage all of the details of our Github repository. Some of these (including, I

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Weston Pace
> The trouble is that Dataset was not designed to serve as a > general-purpose unmaterialized dataframe. For example, the PyArrow > Dataset constructor [5] exposes options for specifying a list of > source files and a partitioning scheme, which are irrelevant for many > of the applications that

Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Weston Pace
Congrats Dewey! On Fri, Jun 23, 2023 at 9:00 AM Antoine Pitrou wrote: > > Welcome to the PMC Dewey! > > > Le 23/06/2023 à 16:59, Joris Van den Bossche a écrit : > > Congrats Dewey! > > > > On Fri, 23 Jun 2023 at 16:54, Jacob Wujciak-Jens > > wrote: > >> > >> Well deserved! Congratulations

Re: [DISCUSS][Format][Flight] Result set expiration support

2023-06-23 Thread Weston Pace
One small difference seems to be that Close is idempotent and Cancel is not. > void cancel() > throws SQLException > > Cancels this Statement object if both the DBMS and driver support aborting an SQL statement. This method can be used by one thread to cancel a statement that is being

Re: Question about `minibatch`

2023-06-20 Thread Weston Pace
Those goals are somewhat compatible. Sasha can probably correct me if I get this wrong but my understanding is that the minibatch is just large enough to ensure reliable vectorized execution. It is used in some innermost critical sections to both keep the working set small (fit in L1) and

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-20 Thread Weston Pace
Before I say anything else I'll say that I am in favor of this new layout. There is some existing literature on the idea (e.g. umbra) and your benchmarks show some nice improvements. Compared to some of the other layouts we've discussed recently (REE, list veiw) I do think this layout is more

Re: [ANNOUNCE] New Arrow PMC member: Ben Baumgold,

2023-06-20 Thread Weston Pace
Congratulations Ben! On Tue, Jun 20, 2023 at 7:38 AM Jacob Quinn wrote: > Yay! Congrats Ben! Love to see more Julia folks here! > > -Jacob > > On Tue, Jun 20, 2023 at 4:15 AM Andrew Lamb wrote: > > > The Project Management Committee (PMC) for Apache Arrow has invited > > Ben Baumgold, to

Re: pyarrow Table.from_pylist doesn;t release memory

2023-06-15 Thread Weston Pace
Note that you can ask pyarrow how much memory it thinks it is using with the pyarrow.total_allocated_bytes[1] function. This can be very useful for tracking memory leaks. I see that memory-profiler now has support for different backends. Sadly, it doesn't look like you can register a custom

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
and adds an extra buffer containing sizes. For symmetry > >> with the List and LargeList types (FixedSizeList not included), I'm > >> going to propose we add a LargeListView. That is not part of the > >> draft implementation yet, but seems like an obvious thing to have >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Weston Pace
t; >> On Sat, May 27, 2023 at 7:44 PM Micah Kornfield > >> wrote: > >>> > >>> This sounds reasonable to me but my main concern is, I'm not sure there > >> is > >>> a great mechanism to enforce canonical layouts don't somehow become > >> de

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Weston Pace
Are you looking for something in C++ or python? We have a thing called the "grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which (if memory serves) is the heart of the functionality in C++. It would be nice to add some python bindings for this functionality as this ask comes

Re: [ANNOUNCE] New Arrow PMC member: Jie Wen (jakevin / jackwener)

2023-06-13 Thread Weston Pace
Congratulations On Tue, Jun 13, 2023, 1:28 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Congratulations! > > On Mon, 12 Jun 2023 at 22:00, Raúl Cumplido > wrote: > > > > Congratulations Jie!!! > > > > El lun, 12 jun 2023, 20:35, Matt Topol > escribió: > > > > > Congrats

Re: [Python] Dataset scanner fragment skip options.

2023-06-12 Thread Weston Pace
> I would like to know if it is possible to skip the specific set of batches, > for example, the first 10 batches and read from the 11th Batch. This sort of API does not exist today. You can skip files by making a smaller dataset with fewer files (and I think, with parquet, there may even be a

Re: [ANNOUNCE] New Arrow committer: Mehmet Ozan Kabak

2023-06-08 Thread Weston Pace
Congratulations! On Thu, Jun 8, 2023, 5:36 PM Mehmet Ozan Kabak wrote: > Thanks everybody. Looking to collaborate further! > > > On Jun 8, 2023, at 9:52 AM, Matt Topol wrote: > > > > Congrats! Welcome Ozan! > > > > On Thu, Jun 8, 2023 at 8:53 AM Raúl Cumplido > wrote: > > > >> Congratulations

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
uch details should be discussed in a separate thread, but I > raise this here just to point out that it implies an expansion in the > scope of what Arrow interfaces can do. > > On Tue, Jun 6, 2023 at 6:17 PM Weston Pace wrote: > > > > From Micah: > > > > >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Weston Pace
I think it might be worth rethinking binding > a > > layout into the schema versus having a different concept of encoding (and > > changing some of the corresponding data structures). > > > > > > On Mon, May 22, 2023 at 10:37 AM Weston Pace > wrote: > > > &

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
same time the page index was read. That's how I'm implementing > with Lance, and how I plan to implement with Delta Lake. But if you can't > do that, then filtering with an anti-join makes sense. You wouldn't want to > include those in a plan. > > On Fri, Jun 2, 2023 at 7:38 AM Weston

Re: Add limit and offset to ScannerOption

2023-06-02 Thread Weston Pace
The simplest way to do this sort of paging today would be to create multiple files and then you could read as few or as many files as you want. This approach also works regardless of format. With parquet/orc you can create multiple row groups / stripes within a single file, and then partition

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
Also, for clarity, I do agree with Gang that these are both valuable features in their own right. A mask makes a lot of sense for page indices. On Fri, Jun 2, 2023 at 7:36 AM Weston Pace wrote: > > then I think the incremental cost of adding the > > positional deletes to the mask

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-02 Thread Weston Pace
mplementation. Table formats (e.g. Apache Iceberg and > > Delta) require the knowledge of row index to finalize row deletion. It > > would be trivial to natively support row index from the file reader. > > > > Best, > > Gang > > > > On Fri, Jun 2, 2023 at

Re: [DISCUSS] Acero's ScanNode and Row Indexing across Scans

2023-06-01 Thread Weston Pace
I agree that having a row_index is a good approach. I'm not sure a mask would be the ideal solution for Iceberg (though it is a reasonable feature in its own right) because I think position-based deletes, in Iceberg, are still done using an anti-join and not a filter. That being said, we

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-22 Thread Weston Pace
ow > > > >>> settled > > > >>>>>> upon, I'm not yet convinced it is sufficiently better to > > incentivise > > > >>>>>> broad ecosystem adoption. > > > >>>>>> > > > >>>&

[DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread Weston Pace
Regrettabl, 12.0.0 had a significant performance regression (I'll take the blame for not thinking through all the use cases), most easily exposed when writing datasets from pandas / numpy data, which is being addressed in [1]. I believe this to be a fairly common use case and it may warrant a

Re: [ANNOUNCE] New Arrow committer: Gang Wu

2023-05-15 Thread Weston Pace
Congratulations! On Mon, May 15, 2023 at 6:34 AM Rok Mihevc wrote: > Congrats Gang! > > Rok > > On Mon, May 15, 2023 at 3:33 PM Sutou Kouhei wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Gang > > Wu has accepted an invitation to become a committer on > > Apache Arrow.

Re: Re: Reusing RecordBatch objects and their memory space

2023-05-12 Thread Weston Pace
3Frand%3D1646387113%3Frand%3D1646387124%3Frand%3D1646387148=shibei.lh%40foxmail.com=KAmESwJvMrwAxwnQWafGjlsCzQ9tgHLSs7s2ohGx7ou54B0-ZyrWJkTg5npy2p1LmT5WQjSlhwncoGhA6w_xb-hQTDq6tGNfwF1sIGtP_HQ> > > > > > 原始邮件 > > 发件人:"Weston Pace"< weston.p...@gmail.com >; > >

Re: Reusing RecordBatch objects and their memory space

2023-05-12 Thread Weston Pace
I think there are perhaps various things being discussed here: * Reusing large blocks of memory I don't think the memory pools actually provide this kind of reuse (e.g. they aren't like "connection pools" or "thread pools"). I'm pretty sure, when you allocate a new buffer on a pool, it always

Re: Freeing memory when working with static crt in windows.

2023-05-12 Thread Weston Pace
You're right that the default is delete/free. However, the important bit is that it needs to be the correct delete/free. The error you described originates from the fact that the final application has two copies of the CRT and thus two copies of delete/free. Since shared_ptr/unique_ptr picks

Re: Freeing memory when working with static crt in windows.

2023-05-12 Thread Weston Pace
I'm not very familiar with Windows. However, I read through [1] and that matches your description. I suppose I thought that a shared_ptr / unique_ptr would not have this problem. I believe these smart pointers store / template a deleter as part of their implementation. This seems to be

Re: [ANNOUNCE] New Arrow committer: Marco Neumann

2023-05-11 Thread Weston Pace
Congratulations! On Thu, May 11, 2023 at 4:28 AM vin jake wrote: > Congratulations Marco! > > On Thu, May 11, 2023 at 7:18 AM Andrew Lamb wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Marco Neumann > > has accepted an invitation to become a committer on Apache > > Arrow.

[Format] Is it legal to have a struct array with a shorter length than its children?

2023-05-05 Thread Weston Pace
We allow arrays to have a shorter length than their buffers. Is it also legal for a struct array to have a shorter length than its child arrays? For example, in C++, I can create this today by slicing a struct array: ``` std::shared_ptr my_array = std::dynamic_pointer_cast(array);

Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-03 Thread Weston Pace
Congratulations! On Wed, May 3, 2023 at 10:47 AM Raúl Cumplido wrote: > Congratulations Matt! > > El mié, 3 may 2023, 19:44, vin jake escribió: > > > Congratulations, Matt! > > > > Felipe Oliveira Carvalho 于 2023年5月4日周四 01:42写道: > > > > > Congratulations, Matt! > > > > > > On Wed, 3 May 2023

Re: [Python] Casting struct to map

2023-05-03 Thread Weston Pace
No, struct array is not naturally castable to map. It's not something that can be done zero-copy and I don't think anyone has encountered this need before. Let me make sure I understand. The goal is to go from a type of STRUCT, where every key in the struct has the same type, to a MAP, where

Re: [DISCUSS][Format][Flight] Ordered data support

2023-04-27 Thread Weston Pace
Thank you both for the extra information. Acero couldn't actually merge the streams today, I was thinking more of datafusion and velox which would often want to keep the streams separate, especially if there was some kind of filtering or transformation that could be applied before applying a

Re: [DISCUSS][Format][Flight] Ordered data support

2023-04-27 Thread Weston Pace
So this would be a case where multiple "endpoints" are acting as a single "stream of batches"? Or am I misunderstanding? What're some scenarios where that would be done? When would it be preferred for the client to merge the endpoints instead of the client's user? On Thu, Apr 27, 2023, 3:22 PM

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Weston Pace
ssing much more efficient. Is this understanding correct? > > > > [1] > > > https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout > > [2] > > > https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Weston Pace
For context, there was some discussion on this back in [1]. At that time this was called "sequence view" but I do not like that name. However, array-view array is a little confusing. Given this is similar to list can we go with list-view array? > Thanks for the introduction. I'd be interested

Re: [DISCUSS] Acero roadmap / philosophy

2023-04-11 Thread Weston Pace
d) or > >>> if additional support is going to be "as-needed". Note that I have a > >>> minimal understanding of how "large" substrait is and what proportion > of > >> it > >>> is already supported by > >>> Acero. &g

Re: [DISCUSSION] C-Data API for Non-CPU Use Cases

2023-04-10 Thread Weston Pace
Sorry, I meant: I am *now* a solid +1 On Mon, Apr 10, 2023 at 1:26 PM Weston Pace wrote: > I am not a solid +1 and I can see the usefulness. Matt and I spoke on > this externally and I think Matt has written a great summary. There were a > few more points that came up in the d

Re: [DISCUSSION] C-Data API for Non-CPU Use Cases

2023-04-10 Thread Weston Pace
pposed to just referring to the dlpack enum and treating this as > an opaque integer if that would be preferable. I definitely agree with the > difficulties in vendoring/repeating the dlpack enum values here and > ensuring it stays up to date. Does anyone else have strong feelings one way >

  1   2   3   4   5   >